Routinely (24 times per day = 1 get per hour) pars...

agoktas · ‎09-13-2017

Hello,

I need to parse a specific web page's table (I'm using PowerShell/WMI ($wc.downloadstring) to download source code) and output to output.txt.

If I pull the entire source code, I get duplicate events/data for obvious reasons - which then throws off my numbers of events (based on repeats)

I need to pull the exact section of the page 24 times a day (1 x per hour), and output to file.

What I need:

The regex syntax to search html source code - specific section/table. Should I use a named variable to identify the code for beginning of the table and the end of the table - which means I can output or index all the content within?

Thanks in advance for your help!

gcusello · ‎09-14-2017

Hi agoktas,
you can take the page using a forwarder installed on the web server and index only the part you're interested.

It's not possible to create a regex without having the page because regex is specific for a source.

Anyway the method I suggest is:

take your page source,
copy it in regex101.com,
extract and test regex,
configure your filtering (props.conf and transforms.conf) using the found regex.

props.conf

[your_sourcetype]
TRANSFORMS-filter=set_nullqueue,filter

transforms.conf

[set_nullqueue]
REGEX=.
DEST_KEY=queue
FORMAT=nullQueue
[filter]
REGEX=your_regex
DEST_KEY = queue
FORMAT = indexQueue

Remember that you have a multi line regex so in the beginning of your regex you have to put (?ms)

Bye.
Giuseppe

agoktas · ‎09-20-2017

Hi cusello,

Thanks for your reply, but this is a web server that I don't own nor do I have admin access to. Please see my reply to Niketnilay the response above. This should clarify my use case a bit more.

Let me know if we need further clarification.

Thanks!

niketn · ‎09-13-2017

@agoktas, the regex will be specific to data. So we would need to get the sample web page data. Please mock up the events if not the entire page so that community can help you with the same.

Please elaborate exact section of the page 24 times a day (1 x per hour) with the data as to which html tag it belongs to and what is the pattern.

____________________________________________
| makeresults | eval message= "Happy Splunking!!!"

agoktas · ‎09-14-2017

What I'm trying to do is parse a table on a particular web site (i.e.: forum posts) every minute or hour (still deciding) and regex a specific named variable. Then I'm going to run reports on this named variable (i.e.: number of occurrences).

I am already using PowerShell to download source code, and am indexing this output.txt.

$wc.downloadstring("https://website.com/forum123/") >C:\PS_Output\Output.txt

The problem I have is when I overwrite the output.txt on the routine interval, I get a lot of duplicates for this named variable. I need a way to write to this output.txt as if it were a traditional log file - thus not have duplicate events.

Hope this clarifies a bit. 🙂

Routinely (24 times per day = 1 get per hour) parse section of HTML page (i.e.: specific table) to output.txt. I need syntax/regex to parse specific section, NOT all source code.

Introducing Splunk Enterprise 9.2

Adoption of RUM and APM at Splunk

Routing logs with Splunk OTel Collector for Kubernetes