Splunk Search

Why does regex to extract host from path work in regex101, while it's not working in Splunk?

mlevsh
Builder

I need to extract "hostname" from the path in data input on directory monitoring.

Path: /export/var/path/host1.log                    ->  Host: host1
Path: /export/var/path/host-02.ac.lp.our.domain.log ->  Host: host-02
Path /export/var/path/host3.ac.lp.our.domain.log   ->   Host:  host3

I tried 3 different regexes. They all work on regex101.com, but only extraction for host1.log works when I use regex as host_regex in inputs.conf.
The other 2 "host" get set to "host-02.ac.lp.our.domain" and "host3.ac.lp.our.domain" after the data is ingested instead of being set to
host-2 and host3.

1) \/export\/var\/path\/(.*?[^.]+)
https://regex101.com/r/hu4Wax/1

inputs.conf:

[monitor:///export/var/path/*.log]
disabled = false
host_regex =/export\/var/path/(.*?[^\.]+)
index = default
sourcetype = default

Splunk sets host to host1, host-02.ac.lp.our.domain and host3.ac.lp.our.domain. Objective was host1,host-02,host3.

2) ^\/\w+\/\w+\/\w+\/?([^.]+)
https://regex101.com/r/jTeVML/1
Inputs.conf:

[monitor:///export/var/path/*.log]
    disabled = false
    host_regex = ^\/\w+\/\w+\/\w+\/?([^\.]+)
    index = default
    sourcetype = default

Splunk sets host to host1 & host-02.ac.lp.our.domain & host3.ac.lp.our.domain

3) \/export\/var\/path\/(.+?)..*log
https://regex101.com/r/yUJY9j/1/

Inputs.conf:

 [monitor:///export/var/path/*.log]
    disabled = false
    host_regex = \/export\/var\/path\/(.+?)\..*log
    index = default
    sourcetype = default

Splunk sets host to host1 & host-02.ac.lp.our.domain & host3.ac.lp.our.domain

Will appreciate any advice!

0 Karma

micheldejong
Explorer

You can check if the regex works with:

| makeresults 
| eval Path="/export/var/path/host-04.ac.lp.our.domain.log"
| rex field=Path ".+\/(?<host>\-?[^.]+).*" 
| table Path host

Just adjust the host in the "| eval Path=........" to check what is hitting with this regex.
This is your working host_regex: (remove the naming of the capture group "?<host>")

host_regex = .+\/(\-?[^.]+).*
0 Karma

FrankVl
Ultra Champion

The issue was never with the regex. His host field was being overwritten by a transforms from some of the config files in etc/system. See https://answers.splunk.com/comments/710989/view.html

0 Karma

woodcock
Esteemed Legend

Your #1 answer should be fine but use this:

host_regex = ^(?:\/\w+){3}\/([^\.]+)

The problem that you are having is that you are not evaluating your changes correctly. Are you restarting the splunk forwarder instance after you drop a change? If so, then likely this is because you are not timestamping your events correctly so you are throwing events into the future and so when you think that you are evaluating the effect of your recent change, you are actually looking at events that were processed from a previous change but have just recently tricked from the future into the present. Put in this change and evaluate your search with the All time timepicker and with these arguments added to the base search, to make sure that you are really seeing events that were indexed recently.

... _index_earliest=-5m _index_latest=@m
0 Karma

mlevsh
Builder

@woodcock, host_regex in data input works, but only for logs like host1.log.
When I search for data host is set to "host1" after ingesting host1.log.
But if a file name of a log has a domain name in it , like host-02.ac.lp.our.domain.log,
then host is set to "host-02.ac.lp.our.domain" instead of "host-02".

I don't think this issue is related to time stamping.

0 Karma

woodcock
Esteemed Legend

Do you really understand what I am saying? The RegEx is fine. It must be that your evaluation for the efficacy of it is improper. i stand by this statement. Re-read what I said, and use the search parameters that I gave you. The problem is NOT the host_regex line.

0 Karma

mlevsh
Builder

@woodcock, sorry for the delay. Hopefully, you will see my reply.
I understood what you said, I don't have to re-read it.
As I was testing it in our development env first, I made sure I wasn't looking on previously ingested data.
I've deleted index, created a new one , and used web gui -> Add Data -> Index Once -> Used current time as time stamp--> Used Regular Expression on path.
Still host was extracted as expected only for "host.log" format , not for "host.ac.lp.doman.name.log" format.

0 Karma

woodcock
Esteemed Legend

There is nothing more that I can do. Something is not as it seems. You should open a support case and report back what you eventually find.

0 Karma

mlevsh
Builder

@woodcock ,
The sourcetype I was using ( I was selecting already existing sourcetype "syslog", was making some modifications to it and was saving under different name) - had [syslog-host] in transforms.conf , that was overriding my host_regex in data input.

Now I have a challenge - how to extract host by using my host_regex without making any changes to sourcetype ( for number of reasons)

0 Karma

woodcock
Esteemed Legend

You only get 1 pass.through the parsing queue and if you are using the syslog sourcetype (which I highly discourage for exactly this reason) then that is the problem. Copy the syslog stuff that you need into your own sourcetype and work from there.

0 Karma

FrankVl
Ultra Champion

Did you try the adjusted regex woodcock suggested? That .*? part in your original regex is not needed and might cause some funky behavior (some regex libraries are more equal than others).

Another option could be is that there is some hostname override happening. Is this syslog-like data, with the hostname also near the start of the log message? By using default sourcetype you may very well get some syslog-host extraction for free defined in system/default/props.conf, which bluntly overwrites whatever you do in inputs.conf.

0 Karma

mlevsh
Builder

@FrankVl , hopefully you will see my comment.
You were right in your suggestion!

The sourcetype I was using ( I was selecting already existing sourcetype "syslog", was making some modifications to it and was saving under different name) - had [syslog-host] in transforms.conf , that was overriding my host_regex in data input.

Now I have a challenge - how to extract host by using my host_regex without making any changes to sourcetype.

0 Karma

saurabhkharkar
Path Finder

Try This

| makeresults
| eval Path="/export/var/path/host3.ac.lp.our.domain.log"
| rex field=Path ".+\/(?[^.]+).*"
| table Path host

0 Karma

mlevsh
Builder

@saurabhkharkar , I'm trying to extract host from log name into "host" field in data input monitoring via host_regex, not in search

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...