Hello,
Let's say i have a csv file that contains sensitive data, I want on index to group multiple lines as one event in a way that it doesnt compromise my data. So let's say:
User - Age
U1 - 12
U2 - 13
U3 - 17
U4 - 15
U5 - 20
How can I group for example each 2 users as one event as so(of course before indexing and not on search time):
U1,U2 - 12,13
U3,U4 - 17-15
...
Thanks in advance
I don't understand the reason for your business case, but here is what I would do to achieve your stated objective. Instead of running it through a standard indexing, I would bring it in, aggregate it, and then collect the aggregated data into a summary index.
| inputcsv mystuff.csv
| rename COMMENT as "Assign every pair of records to a group, then stats the group together "
| streamstats count as recno
| eval groupno = floor( ( 1+ recno ) / 2 )
| stats list(User) as User list(Age) as Age by groupno
| rename COMMENT as "Set time, Get rid of unneeded fields, then copy them to the new index."
| eval _time = now()
| table _time User Age
| collect .... send to desired index...
If you want to break the link of order between each User
and his Age
, then do this to sort the fields after the stats
command. This will break the relationship between any individual Age and its User.
| stats list(User) as User list(Age) as Age by groupno
| eval User=mvsort(User)
| eval Age=mvsort(Age)
If you want to change the number of records in each group to some number K, change line 5 to use your new K-1 and K as follows:
| eval groupno = floor( ( K-1 + recno ) / K )
Updated to remove suggestion to use values
, since that would delete duplicates.
There is also an issue with this anonymization method if using K=2 or k=3 and all of the Users have the same Age. Sigh.
| inputcsv mystuff.csv
| rename COMMENT as "make sure that no two of the same Age are sequential."
| streamstats count as ageno by Age
| eventstats count as totalcount
| eventstats max(ageno) as agecount by Age
| eval myorder=round((ageno-0.5)/agecount,2)
| sort 0 myorder User
| rename COMMENT as "Assign every pair of records to a group, then stats the group together "
| streamstats count as recno
| eval groupno = floor( ( 1+ recno ) / 2 )
| stats list(User) as User list(Age) as Age by groupno
| rename COMMENT as "Set time, Get rid of unneeded fields, then copy them to the new index."
| eval _time = now()
| table _time User Age
| collect .... send to desired index...
To work, the above depends on no one Age predominating in the data set.