Splunk Search

Is there a more efficient way to remove stop words from a text field than using the makemv and mvexpand combo?

andrewtrobec
Motivator

Hello,

I'm currently performing analysis on a free text field and the first step is to remove stop words. This is my approach:

  1. makemv to convert the free text field into a list of words
  2. mvexpand to create an event for each word
  3. search with a lookup containing stop words to remove events I don't need

SPL snippet:

...
| makemv text_field
| mvexpand text_field
| search NOT [ | inputlookup stopwords.csv | rename StopWord as text_field ]
...

When I am using this approach on large sets of data I reach my performance limits very quickly. What I'd like to know is: is there a different approach that I can take to remove the stop words that is less performance heavy than my current approach?

Thank you and best regards,

Andrew

Tags (1)

valiquet
Contributor

With |sed
Can you provide the csv?

0 Karma

andrewtrobec
Motivator

@valiquet
Thank you for your reply.
The csv is a single-column lookup with column name StopWord and is list of all of the words that I would like to remove. Here is a sample from the list (it's much longer):

StopWord
a
about
above
across
after
afterwards
again
against
all
almost
alone
along
already
also
although
always
am
among
amongst
amoungst
amount
an
and
another
any
anyhow
anyone
anything
anyway
anywhere
are
around
as
at
back
be
became
because
become
becomes
becoming
been
before
beforehand
behind
being
below
beside
besides
between

I'd like to point out that I am currently using the sed command to remove punctuation:

rex mode=sed field=text_field"s/[^a-zA-Z0-9_-]+/ /g"

If this can somehow be extended to cover the list of stop words in the lookup (which is a couple of hundred words long) then that would be amazing. Is this possible?

Thank you and best regards,

Andrew

0 Karma
Get Updates on the Splunk Community!

More Control Over Your Monitoring Costs with Archived Metrics!

What if there was a way you could keep all the metrics data you need while saving on storage costs?This is now ...

New in Observability Cloud - Explicit Bucket Histograms

Splunk introduces native support for histograms as a metric data type within Observability Cloud with Explicit ...

Updated Team Landing Page in Splunk Observability

We’re making some changes to the team landing page in Splunk Observability, based on your feedback. The ...