Splunk Search

Top search results from Drupal

staze
Path Finder

Okay, I've done this once in Plone, but we've moved to Drupal, and things don't look the same.

Basically, I want to grab the top search terms from a given timeframe. Drupal search urls look like:

http://site.example.com/search/site/ where is something like "splunk" or "foobar" or, whatever.

A log entry looks something like (in the case I searched for "splunk". Server is apache):

111.222.333.444 - - [06/Feb/2012:14:38:07 -0800] "GET /search/site/splunk HTTP/1.1" 200 9289 "http://site.example.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.77 Safari/535.7"

Previously, in plone, I was using something like:

host="hostname" file="search" SearchableText="*" | eval SearchableText=lower(SearchableText) | top limit=10 SearchableText

But there's no query variable being set like that.

Thoughts? Help?

Tags (3)
1 Solution

araitz
Splunk Employee
Splunk Employee

What is the sourcetype for your Drupal data? It looks like a standard access log. What if you run the following search?

 host="hostname" file="search" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText

UPDATE: the final answer from comments below:

The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):

[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path

Then you can do:

source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment*  NOT last_part="favicon.ico" | top limit=10 last_part

View solution in original post

0 Karma

araitz
Splunk Employee
Splunk Employee

Take a look at the Web Intelligence app, these use cases and a lot more are built in, and the app is free and supported: http://splunk-base.splunk.com/apps/28994/splunk-app-for-web-intelligence

0 Karma

araitz
Splunk Employee
Splunk Employee

What is the sourcetype for your Drupal data? It looks like a standard access log. What if you run the following search?

 host="hostname" file="search" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText

UPDATE: the final answer from comments below:

The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):

[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path

Then you can do:

source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment*  NOT last_part="favicon.ico" | top limit=10 last_part
0 Karma

staze
Path Finder

Cool, that worked! Thanks! I've added the stuff to props.conf, but I have to wait for the webintelligence backfill to finish before restarting splunk.

Thanks again!

0 Karma

araitz
Splunk Employee
Splunk Employee

Otherwise, in-line, it will be far less efficient. As a rule of thumb, as much filtering as possible should be done to the left of the first pipe:

source="/var/log/apache2/access_log" uri_path="/search/site/*" | rex field=uri_path "/(?<last_part>[^/]+)$" | eval last_part=lower(last_part) | search NOT last_part=*comment* | eval last_part = mvfilter(last_part != "favicon.ico" ) | top limit=10 last_part
0 Karma

araitz
Splunk Employee
Splunk Employee

The best thing to do would be to make the 'rex' field extraction a permanent one using props.conf (http://docs.splunk.com/Documentation/Splunk/latest/Admin/Propsconf):

[source::.../access_log*]
EXTRACT-access = "/(?<last_part>[^/]+)$" in uri_path

Then you can do:

source="/var/log/apache2/access_log" uri_path="/search/site/*" NOT last_part=*comment*  NOT last_part="favicon.ico" | top limit=10 last_part
0 Karma

staze
Path Finder

Okay, last one seems to work (with the rex field). I'm very close, the only issue is, I want to ignore any results that contain the word "comment".

Here's what I have:
source="/var/log/apache2/access_log" uri_path="/search/site/*" | rex field=uri_path "/(?[^/]+)$" | eval last_part=lower(last_part) | eval last_part = mvfilter(last_part != "favicon.ico" ) | top limit=10 last_part

The mvfilter is obviously removing "favicon" from the results. And I needed to run the results through "lower" to remove the case duplicates.

Almost....There....

0 Karma

araitz
Splunk Employee
Splunk Employee

So if uri_path is already an extracted field, you don't need the '| kv access-extractions'. You can try this to get query strings:

source="/var/log/apache2/access_log" uri_path="/search/site/*" uri_query=* | top limit=10 uri_query

To get the last part before the query string:

sourcetype="access_combined_wcookie" | rex field=uri_path "\/(?[^\/]+)$" | top limit=10 last_part

0 Karma

staze
Path Finder

I think I'm close. The above didn't work quite right, but this seems close...

source="/var/log/apache2/access_log" uri_path="/search/site/*" | kv access-extractions | eval SearchableText=lower(uri) | top limit=10 SearchableText

Problem is, I'm getting results like:

"/search/site/scholarship". Is there a way to just remove the "/search/site/" part of that result, so I just get the actual search term?

Also, how does one remove certain results? Like, getting a favicon.ico in the results because it happens to get loaded from a location with "/search/site" in the url for some reason...

Thoughts?

And thanks. I've got backfilling going with the webintelligence app... will have to see how that works.

0 Karma
Get Updates on the Splunk Community!

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

Splunk is officially part of Cisco

Revolutionizing how our customers build resilience across their entire digital footprint.   Splunk ...

Splunk APM & RUM | Planned Maintenance March 26 - March 28, 2024

There will be planned maintenance for Splunk APM and RUM between March 26, 2024 and March 28, 2024 as ...