In order to identify web content that hasn't been pulled in a while, I thought I would use Splunk since a) my Apache logs are in Splunk already, and b) I can easily create a scripted input to get a list of files under the various directories. Initially, I'm going to do this for our .cgi's and .pl files
So, I have one index for the standard Apache access logs. I do have a field extraction for this called file. More on that later.
I then created a scripted input that runs once per day to pull a list of files under our content sub-directory (we're talking 13,000+ files). An example of the input looks like this:
09/29/10 15:42:46 -0400,file=actDefaultAccSet.cfm,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=liferayLogin.html,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=favicon.ico,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=favicon.gif,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=Cps_Doc_Upload_Rules.doc,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=ordocs-index.jsp,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=contact_me2.cfm,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=orprefs-index.html,app_root=public,dir=/cfmx_files/cfmx61/public
09/29/10 15:42:46 -0400,file=ppsathanks.html,app_root=public,dir=/cfmx_files/cfmx61/public
I can do a query that looks like this:
index="prod_ohs_logs" [search index="prod_coldfusion_files" file="*\.cgi" OR file="*\.pl" | fields file ] | table file | dedup file
Which only returns 36 out of the 125 .pl / .cgi files out there, which is not exactly what I'm looking for.
Basically, I'm looking to take a list of files from a specific query, check to see how many of those files are found in the Apache logs, including ones with zero results.
I've spent a couple of days trying to get this working, and I haven't been able to. Any ideas on how to do this? Is it even possible?
Your best strategy here is to use an OR search, to load data from both prod_ohs_logs and prod_coldfusion_files at the same time and see, for each file, whether it is in one, the other or both of the indexes. For example:
index="prod_ohs_logs" OR (index="prod_coldfusion_files" file="*\.cgi" OR file="*\.pl") | chart count by file index
Your best strategy here is to use an OR search, to load data from both prod_ohs_logs and prod_coldfusion_files at the same time and see, for each file, whether it is in one, the other or both of the indexes. For example:
index="prod_ohs_logs" OR (index="prod_coldfusion_files" file="*\.cgi" OR file="*\.pl") | chart count by file index
Pure awesomeness Stephen. Thank you!
Just add "... | search prod_condfusion_files=0" to your search.
Great, it's a starting point. I need to figure out how to only list the files that have 1 as the results under prod_coldfusion_files..