Hello,
I need to parse a specific web page's table (I'm using PowerShell/WMI ($wc.downloadstring) to download source code) and output to output.txt.
If I pull the entire source code, I get duplicate events/data for obvious reasons - which then throws off my numbers of events (based on repeats)
I need to pull the exact section of the page 24 times a day (1 x per hour), and output to file.
What I need:
The regex syntax to search html source code - specific section/table. Should I use a named variable to identify the code for beginning of the table and the end of the table - which means I can output or index all the content within?
Thanks in advance for your help!
Hi agoktas,
you can take the page using a forwarder installed on the web server and index only the part you're interested.
It's not possible to create a regex without having the page because regex is specific for a source.
Anyway the method I suggest is:
props.conf
[your_sourcetype]
TRANSFORMS-filter=set_nullqueue,filter
transforms.conf
[set_nullqueue]
REGEX=.
DEST_KEY=queue
FORMAT=nullQueue
[filter]
REGEX=your_regex
DEST_KEY = queue
FORMAT = indexQueue
Remember that you have a multi line regex so in the beginning of your regex you have to put (?ms)
Bye.
Giuseppe
Hi cusello,
Thanks for your reply, but this is a web server that I don't own nor do I have admin access to. Please see my reply to Niketnilay the response above. This should clarify my use case a bit more.
Let me know if we need further clarification.
Thanks!
@agoktas, the regex will be specific to data. So we would need to get the sample web page data. Please mock up the events if not the entire page so that community can help you with the same.
Please elaborate exact section of the page 24 times a day (1 x per hour)
with the data as to which html
tag it belongs to and what is the pattern.
What I'm trying to do is parse a table on a particular web site (i.e.: forum posts) every minute or hour (still deciding) and regex a specific named variable. Then I'm going to run reports on this named variable (i.e.: number of occurrences).
I am already using PowerShell to download source code, and am indexing this output.txt.
$wc.downloadstring("https://website.com/forum123/") >C:\PS_Output\Output.txt
The problem I have is when I overwrite the output.txt on the routine interval, I get a lot of duplicates for this named variable. I need a way to write to this output.txt as if it were a traditional log file - thus not have duplicate events.
Hope this clarifies a bit. 🙂