Solved: Re: How to iterate through values of a field and c...

SPulse · ‎07-10-2017

Hello,

I'm pretty new to Splunk. I've been trying to do this on my own, but can't seem to iterate through this inputlookup table. Your help would be super helpful! Thank you for your help!

I currently have an inputlookup table that looks like:

  Data   | Connected_Data     | Data_count
  D1     | D2 D10 D5          | 5 20 3
  D2     | D1 D3              | 5 10 
  D3     | D2 D5              | 10 ...
  ...

I would like to make a table of all the Connected_Data that Data is connected to:

Therefore, if D1 is connected to D2 ,D10, D5 and consequentially D2 is connected to D1, D3, Then, since we want to see all the data that's connected, even remotely, to each of the Data values: the table will look like:

Data | first_conn   | second_conn                      | third_conn            | ... 
D1    | D2 D10 D5   | X(D1)X,D3,                       | X(D2)X, X(D5)X
                    |(D10's Connected_Data),           | 
                    |(D5's Connected_Data)
D2    |D1, D3       | X(D2)X, D10, D5,                 | (D10's Connected_Data),
      |             | X(D2)X, X(D5)X                   | (D5's Connected_Data)
  ...

Here, please note, that there are X's around certain Connected_Data values, such as in D1's row: D1, D2, D5 have X's around it. This is because they are values that have already been visited. D1 is the node that we are exploring, D2 and D5 have already been visited in the first_conn, so we ignore it in the third_conn. We don't want these repeated values that have X's around them in the final values.

Since we want only the distinct values of Data that are connected (i.e. the union set), we ignore the repeated values.

I would like to finally have a table that is:

    Data  | All_Connected_Data
    D1    | D2, D10, D5, D3 ...
    D2    | D1, D3, D10, D5 ...
    ...

Please let me know if you have any questions or need any additional details. Thank you! Your help/assistance would be super helpful!

DalJeanis · ‎07-11-2017

Oh, crud.

If you just want to know IF they are connected, and don't care about the number of nodes or length, that's pretty easy.

Create records that look like this...

node1 node2
D1    D2 
D1    D5
D1    D10
D2    D1
D5    D1
D10   D1

...then run them into this...

| dedup node1 node2
| table node1 node2
| eval node1=mvappend(node1,node2)
| eval node2=node1
| rename COMMENT as "In essence, this is dumping all records that are connectable to each other into a pot and stirring."
| rename COMMENT as "The next commands guarantee that all length 2 nodes are connected. Most length 3, but that's not guaranteed."
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2

| rename COMMENT as "Every time you repeat the next two commands SHOULD double the length of paths that are connected."
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2
| rename COMMENT as "The above should have connected up all lengths up to 32."

... and finally into this

| mvexpand node1 | stats values(node2) as node2 by node1

... which will produce one record per node1, with the values of all node2s that can be connected to that node by half that length path, regardless of cost.

See below for comments if you need to deal with path length or path cost...

The key to your question, according to my prior thinking, was not the word "iterate", but the word "hierarchical". You have to think in terms of looking for ways to connect up the nodes.

However, I see on further inspection that your data is redundant (containing duplicate node lengths in both directions) and also potentially cyclical, clearly not hierarchical.

And your data format is not helping you solve it.

I have a requirements question here - if there exists an edge from D3 to D10 with length 6, for example, what do you want to do? The path D1-D5-D10 would be 9 units long, 11 units shorter than the path D1-D10. Are you trying to find the shortest path, or all paths, or what?

That requirement will determine what method you need to use to build the table.

I can tell you already that any decent solution will probably require you to start by turning your data format into something like...

node1 node2 length cost
D1       D2         1         5
D1       D10       1         20

...and over your iterations (or other method) you will keep any record that is not dominated by another record. That is, you eliminate a record2 only if there is another record1 where length1<=length2 AND cost1<=cost2. In the case of duplicates, it doesn't matter which one is eliminated.

That means that after calculating the first length-2 set, you'd have records somewhat like this

node1 node2 length cost
D1       D2         1         5
D1       D5         1         3
D1       D10       1         20
D1       D10       2         9
D5       D10       1         6

View solution in original post

DalJeanis · ‎07-11-2017

Oh, crud.

If you just want to know IF they are connected, and don't care about the number of nodes or length, that's pretty easy.

Create records that look like this...

node1 node2
D1    D2 
D1    D5
D1    D10
D2    D1
D5    D1
D10   D1

...then run them into this...

| dedup node1 node2
| table node1 node2
| eval node1=mvappend(node1,node2)
| eval node2=node1
| rename COMMENT as "In essence, this is dumping all records that are connectable to each other into a pot and stirring."
| rename COMMENT as "The next commands guarantee that all length 2 nodes are connected. Most length 3, but that's not guaranteed."
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2

| rename COMMENT as "Every time you repeat the next two commands SHOULD double the length of paths that are connected."
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2
| mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2
| rename COMMENT as "The above should have connected up all lengths up to 32."

... and finally into this

| mvexpand node1 | stats values(node2) as node2 by node1

... which will produce one record per node1, with the values of all node2s that can be connected to that node by half that length path, regardless of cost.

See below for comments if you need to deal with path length or path cost...

The key to your question, according to my prior thinking, was not the word "iterate", but the word "hierarchical". You have to think in terms of looking for ways to connect up the nodes.

However, I see on further inspection that your data is redundant (containing duplicate node lengths in both directions) and also potentially cyclical, clearly not hierarchical.

And your data format is not helping you solve it.

I have a requirements question here - if there exists an edge from D3 to D10 with length 6, for example, what do you want to do? The path D1-D5-D10 would be 9 units long, 11 units shorter than the path D1-D10. Are you trying to find the shortest path, or all paths, or what?

That requirement will determine what method you need to use to build the table.

I can tell you already that any decent solution will probably require you to start by turning your data format into something like...

node1 node2 length cost
D1       D2         1         5
D1       D10       1         20

...and over your iterations (or other method) you will keep any record that is not dominated by another record. That is, you eliminate a record2 only if there is another record1 where length1<=length2 AND cost1<=cost2. In the case of duplicates, it doesn't matter which one is eliminated.

That means that after calculating the first length-2 set, you'd have records somewhat like this

node1 node2 length cost
D1       D2         1         5
D1       D5         1         3
D1       D10       1         20
D1       D10       2         9
D5       D10       1         6

SPulse · ‎07-12-2017

@DalJeanis: could you please explain how when we repeat the following piece of code, the length doubles: | mvexpand node1 | stats values(node2) as node2 by node1 | eval node1=mvdedup(mvappend(node1,node2)) | eventstats values(node1) as node1 by node2

Also, could we possibly create new variables (eval fields) every time we repeat this piece of code to store the increase in length as the next layer?

Thank you!

DalJeanis · ‎07-13-2017

Sure. The below example will show how the node neighborhood goes in one step from 4+1 to 8+1.

Let's suppose that at time N, you have two records (you'll have a lot more, but look at these two) -

  node1                                     node2                                
  X10 X11 X12 X13 X14           X10 X11 X12 X13 X14
  X14 X15 X16 X17 X18           X14 X15 X16 X17 X18

The mvexpand will break that into 10 records, two of which have the value of node1 as X14.

  node1                                     node2                                
  X14                                          X10 X11 X12 X13 X14
  X14                                          X14 X15 X16 X17 X18

The stats will combine those two records into one record like this...

  X14                                        X10 X11 X12 X13 X14 X15 X16 X17 X18

The eval will do this...

 X10 X11 X12 X13 X14 X15 X16 X17 X18
                                               X10 X11 X12 X13 X14 X15 X16 X17 X18

... and the last one does some cleanup. That's how it goes from 5 to 9, or 9 to 17, or 17 to 33 in one step.

SPulse · ‎07-14-2017

Thank you @DalJeanis

SPulse · ‎07-11-2017

Thank you @DalJeanis! You seem to be very close to arriving at the solution.

But I would like to have the added information of each of the layers of nodes from the node1 (or Data) to the farthest node that it is even remotely connected. So, I want to know that if D1 is connected to D2, D5, D10 in the first layer, then the second layer would be D2's connections, D5's connections, D10's connections and so on, till we reach the last layer. I'm not looking to find the shortest path.

About the path cost : Now, I'm going to translate the rest into the vocabulary and idea you're using i.e. that related to graphs. I already have the cost of each of the paths between the nodes. When we have the layers of nodes from the each of the node1's to the last node it is even remotely connected to, I would like each of the costs to be like:

Please let me know if I can provide more details. Your help is appreciated a lot!

DalJeanis · ‎07-10-2017

Oh, gosh, I believe I worked on something like this back in March... and I believe you're doing way too much work. Let me sleep on it and maybe it will pop up.

Wait, here, check this out and see if it meets your needs...
https://answers.splunk.com/answers/357876/how-do-you-order-stats-by-multiple-hierarchical-fi.html

SPulse · ‎07-11-2017

Thanks @DalJeanis! I looked at the question you referred to and it talks about a way to count by multiple fields and order it by different fields. Unfortunately, I don't seem to be able to connect it with my question.

I would like to be able to access and combine the values from the Data field, for each value of the Connected_Data. Moreover, combine it such that we don't have any repeated values (I believe we can just perform the values() command on the combined field for that.

Thus, go to D1 data point and see that it has D2, D10, D5 as it's Connected_Data. So I would like to make another field, which searches Data field for D2 and pulls in all it's Connected_Data, then does the same for finding both D10 and D5 in the Data, and pull in their Connected_Data respectively. Then, perform some sort of a values() function so that the values in this field are not repeated. Similarly, then it would repeat the same process for the third_connection too. There's over a 100 Data values.

Please let me if you would like any additional details. Thank you SO much for your help.

How to iterate through values of a field and combine values of other fields respectively?

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

.conf24 | Personalize your .conf experience with Learning Paths!

Threat Hunting Unlocked: How to Uplevel Your Threat Hunting With the PEAK Framework ...