Hi
I would like to know if I am overfitting. Why are my results too good?
The algorithm has never seen the JUNE dataset. I trained it with the MAY dataset. But the prediction is very good.
Also, I have tested with a "dummy" dataset. It is the one that comes by default with MLTK. Results are bad.
I have been thinking that "maybe" my SPL is wrong. But I am not sure.
Thank you
IMAGE 1
TRAIN
| inputlookup fortigate_QC_May2019_logins.csv //loading the dataset **MAY** company A
| fit StandardScaler "logins" with_mean=false with_std=true //normalizing data
| fit DBSCAN "SS_logins" //finding outliers
| where NOT isOutlier==-1 //erasing the outliers
| fit LinearRegression SS_logins from * into "authentication_profiling_LinearRegression" //applying the algorithm and saving it
TEST
| inputlookup fortigate_QC_June2019_logins.csv //loading the dataset **JUNE** --company A
| fit StandardScaler "logins" with_mean=false with_std=true //normalizing the data
| apply "authentication_profiling_LinearRegression" //applying the saved model
| table _time, "SS_logins", "predicted(SS_logins)" //making predictions
TESTING WITH DUMMY DATASET
| inputlookup logins.csv //this is the dummy dataset: the logins are from company B
| fit StandardScaler "logins" with_mean=false with_std=true //normalizing
| apply "authentication_profiling_LinearRegression" //applying the model from company A
| table _time, "SS_logins", "predicted(SS_logins)"
THIS IS THE PLOT WITH THE DUMMY DATASET
Results are not good.
IMAGE 2
... View more