Thanks You SukiSen,
Please find below Answers to your questions :-
1- How big is your sample data size and what is the split you have done between test and sample data -
I have around 51 k events and split is of 80-20
2-Without knowing how many distinct values your JobGroup has, is it possible to convert them to numeric like you said? It might be a bit too much if you have too many jobgroups but I would still suggest trying.
I have converted JobGroups into numeric fields and it has more than 90 distinct values.
3- About RMSE. Now, having a high RMSE in general is not good. What does this mean? It means that the cases where there is a variance between predicted values and actual values, the variance is high. So even if say for 95% of your predictions you are reasonably accurate, it could still mean the rest 5% predictions are so huge that your total RMSE is getting too big.
Yes, for some result variance is too big.
4- This does not invalidate your model, remember this is an IT scenario. For example say at around 11 AM on 5-6 days someone executed other unplanned jobs (or the jobs normally taking 2 minutes took 20 mins/1 hour / even got hung and had to be killed off). It would mean a very high RMSE , does not mean your model is invalid. Typically jobs by nature will have this kind of scenarios in any IT environment.
True.
5- Have you considered applying pre-processing? Just choose standard scaler and apply it to both the dependent and independent variables, it should improve your prediction.
I tried this, my R square value increased up to 0.98 which is really good, but RMSE value is still high up to 198. I am not getting what to do,but RMSE value was good previously.
6- How does your adjusted r square look like?
It's between 90-98 now.
So what I am suggesting as next steps is:
1-Apply a split of 80-20 or 90-10 between your sample and test data and see how accurate your predictions are. If you apply a split of 80-20, the last 20% of your predicted values can be accurately compared with the actual values and you can see how well the model really behaves.
Done.Still same result as above.
2-Quantify your jobgroup and run random forest again, let us see how long it takes to run 🙂
I didn't get this.
3-Lastly and most importantly verify the number of occasions your predicted values are different from actuals. Say, for example you have 10000 data points but only 10-20 cases of really high RMSE is present, your model is good. It simply means that on some rare occasions something happened (bad code/system outage) which took the jobs more time to execute than normal. What would be of concern is if your RMSE is evenly spread like 10-20% of your prediction is skewed.
Okay,
Still can i consider this model as Better than previous
... View more