I'm noticing some weird behavior in a search that is requiring me to inline some regexs in order to get the MR job to work.
Here are the relevant contents of
$HUNK_HOME/etc/apps/{non_searchapp_app}/local/props.conf
:
[myvix_sourcetype]
EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+)
Example Search: (Smart Mode)
index=myvix source=*events*
Example Search: (Smart Mode)
index=myvix source=*events* | table _time, my_field
I get the following results:
_time my_field
2015-05-26 16:19:57
2015-05-26 16:19:57
...
Inline the rex and don't rely on the field extraction in props.conf.
index=myvix source=*events* | rex field=message "^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+)" | table _time, my_field
results in the following:
_time my_field
2015-05-26 16:19:57 my_field_value-A
2015-05-26 16:19:57 my_field_value-B
Inlining the following regex (e.g. field=raw) **_does not work**!!!
index=myvix source=*events* | rex field=_raw "^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+)" | table _time, my_field, _raw
results:
_time my_field _raw
2015-05-26 16:19:57 {"header": {"time": 1432675197252, "threadId": "qtpXXXX", "requestMarker": "abadbeef42c8", "env": "production", "server": "some-prod-server", "service": "some-service"}}
2015-05-26 16:19:57 {"header": {"time": 1432675197253, "threadId": "qtpYYYY", "requestMarker": "8badbeef9139", "env": "production", "server": "some-otherprod-server", "service": "some-other-service"}}
Notice that _raw doesn't work because the 'message' field of the _raw avro record is not being included. Only the 'header' field is being included.
FWIW, the regex was generated using the "Event Action -> Extract Fields" UI from the main search view.
And as one last attempt to self-service and figure this out, I added message
to the table command.
and it works!! Go figure.
index=myvix source=*events* | rex field=_raw "^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+)" | table _time, my_field, _raw, message
results:
_time my_field _raw message
2015-05-26 16:19:57 my_field_value-A {"header": {"time": 1432675197252, "threadId": "qtpXXXX", "requestMarker": "abadbeef42c8", "env": "production", "server": "some-prod-server", "service": "some-service"}, "message": "t.blah.X.blah.blah.blah - |x|xxx|xxx|xxxx|xxx-xxxx|my_field_value-A|xxxx|x|x|blah&blah&blah|xxx/xxx|x|x|"} t.blah.X.blah.blah.blah - |x|xxx|xxx|xxxx|xxx-xxxx|my_field_value-A|xxxx|x|x|blah&blah&blah|xxx/xxx|x|x|
2015-05-26 16:19:57 my_field_value-B {"header": {"time": 1432675197253, "threadId": "qtpYYYY", "requestMarker": "8badbeef9139", "env": "production", "server": "some-otherprod-server", "service": "some-other-service"}, "message": "t.blah.X.blah.blah.blah - |x|xxx|xxx|xxxx|xxx-xxxx|my_field_value-B|xxxx|x|x|blah&blah&blah|xxx/xxx|x|x|"} t.blah.X.blah.blah.blah - |x|xxx|xxx|xxxx|xxx-xxxx|my_field_value-B|xxxx|x|x|blah&blah&blah|xxx/xxx|x|x|
So it seems I have to tell hunk ahead of time which "raw fields" to include then it will "auto extract" ?
Ahh, the "corollary" and "corollary++" are actually very important in what you're experiencing - basically what is happening is that Hunk does not have any knowledge that the field is being extracted from the "message" field and therefore the Avro reader doesn't output it - thus the extraction fail. Why does it work when you run "index=vix source=events" ? Well, if you're not running a reporting search (stats, timechart etc) the search is effectively ran in "verbose mode"
There are two ways to fix this:
a) if there are some fields that you always need some fields you can tell the record readers to always output them - check this answer for how to do that
b) you can tell the extractor that the field is actually being extracted from another field by modifying the extraction rule as follows:
[myvix_sourcetype]
EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+) IN message
Unfortunately both methods require you to edit .conf files.
Ahh, the "corollary" and "corollary++" are actually very important in what you're experiencing - basically what is happening is that Hunk does not have any knowledge that the field is being extracted from the "message" field and therefore the Avro reader doesn't output it - thus the extraction fail. Why does it work when you run "index=vix source=events" ? Well, if you're not running a reporting search (stats, timechart etc) the search is effectively ran in "verbose mode"
There are two ways to fix this:
a) if there are some fields that you always need some fields you can tell the record readers to always output them - check this answer for how to do that
b) you can tell the extractor that the field is actually being extracted from another field by modifying the extraction rule as follows:
[myvix_sourcetype]
EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?<my_field>[^\|]+) IN message
Unfortunately both methods require you to edit .conf files.
Since the original developer used the UI to create the regex, it would be great if the UI could infer that message is required. It severely limits what end users can do for "schema-on-read" use-cases.... requiring a ticket for each field-extraction for the admin to go in and edit.
I tried both approaches and both worked, as advertised.
Since this is specific to the {non_searchapp_app} and since I only need it to pull in that field when it needs to I went with b).
[myvix_sourcetype]
EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?≺my_field≻[^\|]+) in message
It worked like a charm! Thanks @ledion once again!
we're already tracking a similar enhancement request internally, for your reference SPL-94381
In the props.conf do you have your HDFS directory?
[source::/user/hunk/data/England/...]
sourcetype = England
EXTRACT-myField = XYZ
In $HUNK_HOME/etc/system/local/props.conf
(note: that's system/local
not apps/{non_searchapp_app}/local
😞
[myvix_sourcetype]
EVAL-_time = strptime('header.time', "%s%3N")
TRUNCATE = 102400
MAX_TIMESTAMP_LOOKAHEAD = 30
[source::/user/hunkuser/data/...]
sourcetype = myvix_sourcetype
In $HUNK_HOME/etc/apps/{non_searchapp_app}/local/props.conf
:
[myvix_sourcetype]
EXTRACT-myField = ^(?:[^\|\n]*\|){6}(?≺my_field≻[^\|]+)
I'd also recommend revising the time extraction rule based on this answer - eval based timestamp extraction causes time based partition pruning to be disabled.
@ledion thanks for pointing that out. I had actually read that answer and always focused on the RHS (e.g. the "%s%3N") and not the LHS (e.g. EXTRACT-_time vs EVAL-_time). I'll investigate and report back.
@Ledion, going with this:
[myvix_sourcetype]
#EVAL-_time = strptime('header.time', "%s%3N")
#EXTRACT-_time = strptime('header.time', "%s%3N")
TRUNCATE = 102400
TIME_PREFIX = "time":[ ]
TIME_FORMAT = %3N
MAX_TIMESTAMP_LOOKAHEAD = 40
Two more things:
a) make sure to add header.time in the required fields for the vix
b) you'd need to fix TIME_FORMAT, probably need "%s%3N" (or maybe that's what you have and it doesn't render right here)
Yup.
- Already had header.time as required fields for the vix.
- Missed the %s... added it