We have several servers that support an HTTP request that presents a page of activity in a simple HTML table. I'd like to GET the page, find the table, and use the values as Splunk fields, and then the s as the values, with one Splunk event per table row.
Website input seems like the right app for this, but I can't quite get it working.
I defined the input to produce a preview like this:
| webscrape selector="#myTable tr" url="http://server1.ourcompany.com/cgi-bin/reportStatus.cgi" depth_limit=25 text_separator="|" empty_matches=0
And I use a sourcetype of, say, "server_status".
The results include the html table such as:
<table id="myTable" border="1">
<tbody>
<tr>
<th>Seq</th>
<th>Time</th>
<th>Source</th>
<th>Event</th>
</tr>
<tr style="background: rgb(224, 224, 224);">
<td>#1</td>
<td>2012_02_07 03:02:08</td>
<td>some-process_log</td>
<td>file received</td>
</tr>
<tr>...</tr>
</table>
I'm aiming for Splunk events that look like:
Seq=#1
Time=2012_02_07 03:02:08
Source=some-process_log
Event=file receive
What I see is:
|39||#45||2013_08_09 22:20||some processing source||Sending confirmation email|
This all seems like I'm going down the wrong rabbit hole. Can someone advise the right way to get this done?
Thanks!
Tables are kind of tricky.
I do think that parsing the data in Splunk's search language is the way to go.
1. Use webscrape search command and Search commands to index parsed data
You could use the webscrape search command along with some search commands to get the data and get it into the format you want. Then you could use the collect command to persist it the index you want in a fully parsed state.
2. Store the data in a separate index, then use a search to parse and reindex that data into the final index
Another approach would be to create an input to get the data and store it in a separate index and then use a scheduled search to parse the results and moved the parsed results to the index where you want the final data.
3. Use a search to parse the results directly
Finally, you could just the UI top create the input and then use some search commands each time you review the data that will parse out the matches (like you are currently doing).
Thanks for the feedback. I admit to not following all your suggestions. I fixed up my #1 above and added a #4, which is background to the below approach in which I split the match field and assign fields manually. (I was obviously hoping that there was some magical way of having the TH tags become in essence the CVS header, and the TD tags the values. I suppose I could make a custom command in python to do this as well, but here's what I hacked for now. Not sure if which of your suggestions it's closest to (if any):
| webscrape selector="#ContentTbl tr" url="http:/server1.ourcompany.com/cgi-bin/reportStatus.cgi" depth_limit=25 text_separator="|" empty_matches=0
| fields match
| rex field=match mode=sed "s/\|\|/|/g"
| rex field=match mode=sed "s/^\|//"
| rex field=match mode=sed "s/\|$//"
| mvexpand match
| eval flds=split(match, "|")
| fields - match
| eval Seq=mvindex(flds,0)
| eval Time=mvindex(flds,1)
| eval Source=mvindex(flds,2)
| eval Event=mvindex(flds,3)
| fields - flds
The goal is to eventually do this for multiple servers 1, 2, 3..etc. Then splunkify somehow to do reporting against them. For this, I expect I'd have to refetch the data on a schedule and then dedup before indexing or adding to a lookup. I'm still a splunk newbie/lame-o, so feeling my way here.