Too many distinct values for categorical feature i...

jthairu_splunk · ‎09-30-2017

I am attempting to use the Machine Learning Toolkit to predict what cities people are from based on information such as zip code, region code, country code etc. The issue I am having is that splunk only accepts 100 'distinct values' and I have over 1000. As a result I am getting the following error (Error in 'fit' command: Error while fitting "LogisticRegression" mode)
Anyone know of a solution? Thanks!

aljohnson_splun · ‎10-23-2017

hi @Jthairu

one work around to the limit is to do the "one-hot" encoding yourself with eval & fillnull - say your field foo has over 100 distinct values:

| eval {foo} = 1
| fillnull

Often, if you're running into this limit, it might be a good thing to question if each of those categorical values really brings meaning to your model.

Sukisen1981 · ‎10-01-2017

well i don't understand your model at all.
1- If you have a zip code why do you need to predict the city? Isn't the city name DIRECT extrapolation of the zip code? So what exactly are you trying to 'predict' here?
2- This is a logistic regression, you can not have so many distinct values of MBR_CITY_CODE_DELETED . You need the logistic regression model a bit more in depth - In statistics, logistic regression, or logit regression, or logit model[1] is a regression model where the dependent variable (DV) is categorical. This article covers the case of a binary dependent variable—that is, where the output can take only two values, "0" and "1"....
I would say if you are trying to predict the city name using a logistic regression model, you are probably trying the wrong model.

jthairu_splunk · ‎10-01-2017

@Sukisen1981 Yes you are correct, the logistic regression model is not the correct model to predict city name. To answer your question, I have a data set that contains information about people such as region code (not zip code), country code, home airport, favorite airport etc, and based on this information, I want to be able to predict what city they are from. Does Splunk's Machine Learning Toolkit have the capability to do this with the algorithms natively contained in it?

Sukisen1981 · ‎10-01-2017

hi,
From what I understand, you are not really trying to 'predict' something...in the sense that every person ALREADY belongs to some city. There is nothing to predict here.
However, I can also sense what you are trying to do here.
You already have a historical data set consisting of region code (not zip code), country code, home airport, favorite airport etc, AND the city of the individual so concerned. Based on this you are trying to predict the city of a new individual. If so far I am right , you have 2 options -
1- Classify all your qualitative data into numeric data for example fav. airport like , if OAK then 1 , if BOS then 2 and so on AND then use the Cluster Numeric Events algorithm to predict which new cluster a new individual would belong to.
2- Use the jellyfisher app and compute the jaro / jaro winkler distance between all the text fields AND the new individual's data. The sum of the total jaro distance of all fields , sorted in descending order will give you the best match of the new individual to the historical data set. You just need to pull the city name from the historical data set then for each row.

I would recommend the 2nd approach, the jaro distacne works very well, takes care of spelling mistakes (so that BOSTON is almost equal to BOTSON as compared to say BALDWIN) once you get the top 2-3 jaor distance matches , you can pull the city field from the historical data, apply a dedup in case on duplicate matches. Try it, I did a similar piece of work for predicting the location of callers based on their time zone,address and work address it works with round 94% accuracy

jthairu_splunk · ‎10-01-2017

@Sunkisen1981 You are correct, the logistic regression model was not the right model to predict the city name. To answer your question I have a data set that has information such as region code, country code, home airport, favorite airport (not zip code) etc. and I want to create a model that based on this type of information will predict where a person is from. Does Splunk's Machine Learning Toolkit have the capability to do so with the algorithms natively contained?

Too many distinct values for categorical feature in the Machine Learning Toolkit

ICYMI - Check out the latest releases of Splunk Edge Processor

Introducing the 2024 SplunkTrust!

Introducing the 2024 Splunk MVPs!