All Apps and Add-ons

Data to Train Machine Learning

kevinmabini
Engager

Hey Guys,

Hope you can give insights on this.

Currently, we are using Machine Learning in predicting new ticket type/category. We are using the whole index as an input to train our model.

My question, is it right to use the whole index? Or does ML just need a new set of data/events in training the model?,Hey Guys,

Hope you can give insights on this.

Currrently, we are using Machine Learning(ML) to predict a certain ticket type/category. We are using the whole index as an input to train the model.

My question, is it right to use the whole index as an input? Or does ML just need the new set of data in training the model?

0 Karma
1 Solution

DalJeanis
SplunkTrust
SplunkTrust

As a general case, NO. You should reserve at least half the data for testing the results of the training. Otherwise, how will you validate that the result is reasonable?

Second, "predicting new ticket type/category" is pretty vague. What is the research question? What are you looking to achieve by having this new category? What kind of tickets are we talking about - airplane tickets, trouble tickets, concert tickets, sports tickets?

View solution in original post

0 Karma

DalJeanis
SplunkTrust
SplunkTrust

As a general case, NO. You should reserve at least half the data for testing the results of the training. Otherwise, how will you validate that the result is reasonable?

Second, "predicting new ticket type/category" is pretty vague. What is the research question? What are you looking to achieve by having this new category? What kind of tickets are we talking about - airplane tickets, trouble tickets, concert tickets, sports tickets?

0 Karma

kevinmabini
Engager

Thanks for the response @DalJeanis!

By ticket, i mean these are incidents logged by the users. We are using ML to auto-categorize the logged incident if it is an 'Admin Request' or 'Change Request' etc.

For the data, what if there are new set of data ingested in Splunk and was also auto-categorized, is it advisable to use that as a training data for ML?

0 Karma

iceco
New Member

Do you know what's the SPL command to split training and testing, I didn't see it at doc. Thanks

0 Karma

worshamn
Contributor

It is not the clearest thing in the docs, but you use the sample command that comes with MLTK and specifically use the partitions option (set to 10 is usually what you want) and then you have to search on partition_number < X. If you are doing the 70/30 split would be less than 7 as it starts counting at 0 and make sure to use seed option so you can come back and search partition_number > X-1 to get the other side of the split.
Training set:
| sample partitions=10 seed=1234 | search partition_number < 7 | fit MLAlgoName target_field from whatever_fields into saved_model_name
Test set:
| sample partitions=10 seed=1234 | search partition_number > 6 | apply saved_model_name as predicted_field

nasrinmulani
New Member

You can take splits between training and test as 70/30. Hence it will take 70% data for training and 30% for testing.

0 Karma
Get Updates on the Splunk Community!

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...

What’s New in Splunk Security Essentials 3.8.0?

Splunk Security Essentials (SSE) is an app that can amplify the power of your existing Splunk Cloud Platform, ...

Let’s Get You Certified – Vegas-Style at .conf24

Are you ready to level up your Splunk game? Then, let’s get you certified live at .conf24 – our annual user ...