Input a simple html table to splunk fields/values ...

kmarx · ‎04-26-2019

We have several servers that support an HTTP request that presents a page of activity in a simple HTML table. I'd like to GET the page, find the table, and use the values as Splunk fields, and then the s as the values, with one Splunk event per table row.

Website input seems like the right app for this, but I can't quite get it working.

I defined the input to produce a preview like this:

| webscrape selector="#myTable tr" url="http://server1.ourcompany.com/cgi-bin/reportStatus.cgi" depth_limit=25 text_separator="|" empty_matches=0

And I use a sourcetype of, say, "server_status".

The results include the html table such as:

<table id="myTable" border="1"> 
<tbody>
<tr>
    <th>Seq</th>
    <th>Time</th>
    <th>Source</th>
    <th>Event</th> 
</tr>
<tr style="background: rgb(224, 224, 224);">
    <td>#1</td>
    <td>2012_02_07 03:02:08</td>
    <td>some-process_log</td>
    <td>file received</td>
</tr>
<tr>...</tr>
</table>

I'm aiming for Splunk events that look like:

Seq=#1
Time=2012_02_07 03:02:08
Source=some-process_log
Event=file receive

What I see is:

The header row with <TH>'s isn't captured as part of the the match field, I've played with other CSS selectors and, while results differ, they're still not right
In the above preview query I can sort of get close by piping to ** | fields match | mvexpand match |...** where I could then probably further split on the '|' separator, however
When I do a Splunk query such as sourcetype="server_status" I get a single event in which match is single valued and I have nothing seemingly to split on
The match field actually gets leading/trailing and doubled up "|" delimiters, presumably because of the start/end <> tags? A la |39||#45||2013_08_09 22:20||some processing source||Sending confirmation email|

This all seems like I'm going down the wrong rabbit hole. Can someone advise the right way to get this done?

Thanks!

LukeMurphey · ‎04-29-2019

Tables are kind of tricky.

I do think that parsing the data in Splunk's search language is the way to go.

1. Use webscrape search command and Search commands to index parsed data
You could use the webscrape search command along with some search commands to get the data and get it into the format you want. Then you could use the collect command to persist it the index you want in a fully parsed state.

2. Store the data in a separate index, then use a search to parse and reindex that data into the final index
Another approach would be to create an input to get the data and store it in a separate index and then use a scheduled search to parse the results and moved the parsed results to the index where you want the final data.

3. Use a search to parse the results directly
Finally, you could just the UI top create the input and then use some search commands each time you review the data that will parse out the matches (like you are currently doing).

kmarx · ‎04-29-2019

Thanks for the feedback. I admit to not following all your suggestions. I fixed up my #1 above and added a #4, which is background to the below approach in which I split the match field and assign fields manually. (I was obviously hoping that there was some magical way of having the TH tags become in essence the CVS header, and the TD tags the values. I suppose I could make a custom command in python to do this as well, but here's what I hacked for now. Not sure if which of your suggestions it's closest to (if any):

| webscrape  selector="#ContentTbl tr" url="http:/server1.ourcompany.com/cgi-bin/reportStatus.cgi" depth_limit=25 text_separator="|" empty_matches=0 
| fields match
| rex field=match mode=sed "s/\|\|/|/g"
| rex field=match mode=sed "s/^\|//" 
| rex field=match mode=sed "s/\|$//"
| mvexpand match 
| eval flds=split(match, "|") 
| fields - match
| eval Seq=mvindex(flds,0) 
| eval Time=mvindex(flds,1)
| eval Source=mvindex(flds,2) 
| eval Event=mvindex(flds,3)
| fields - flds

The goal is to eventually do this for multiple servers 1, 2, 3..etc. Then splunkify somehow to do reporting against them. For this, I expect I'd have to refetch the data on a schedule and then dedup before indexing or adding to a lookup. I'm still a splunk newbie/lame-o, so feeling my way here.

Input a simple html table to splunk fields/values via http request

Introducing the Splunk Community Dashboard Challenge!

Built-in Service Level Objectives Management to Bridge the Gap Between Service & ...

Get Your Exclusive Splunk Certified Cybersecurity Defense Engineer Certification at ...