Splunk Search

CSV files with variable fields

l0r3zz
New Member

I want to gobble in CSV files containing numeric data. Each file will have between 500 and 150,000 fields. (Yes that's right 150K :). The first line of the CSV will have the column names (headers). Each CSV file can have different field names and a different number of fields. If anyone is acquainted with esxtop batch mode output From VMware hypervisors), you know what I'm talking about. I'm relatively new to Splunk, but what I want to eventually accomplish is to write a dashboard that will be able to manipulate the data found in these fields. If you've seen the ESXplot python tool (that I wrote) , you will get the idea. Any help on how I might begin to look at this would be helpful.

Tags (2)
0 Karma

l0r3zz
New Member

To answer your second question, each distinct file can have a different set of field names. Part of the field name will be the hostname of the machine writing it, if ther are say, 50 VMs running, there may be 100 - 1000 metrics or each of those VMs. Now under each field there will be data, but a 15 minute run can only have 750 samples, so you can generate very wide but somewhat shallow CSV files.

0 Karma

l0r3zz
New Member

This is the way performance data comes out of VMware ESX hypervisors, to see how it might be used, see www.durganetworks.com/esxplot. I wrote a python program that allows the user to navigate through this "sea of data". I'd like to make a Splunkable version.

0 Karma

Lowell
Super Champion

If not all the fields are populated for all events, then it may make sense to use a python input script to convert the CSV-style input into a list of key=value pairs instead. There are pros/cons for either approach, but it may be worth considering some kind of pre-processing.

l0r3zz
New Member

Well maybe not key-value unless the value is a list of values across time. i.e mars.edu/CPU0/%RDY will have 750 samples under it, one every 2 seconds.

0 Karma

araitz
Splunk Employee
Splunk Employee

This is just about the approach that our developers are using on our forthcoming VMWare app (s/python/perl/g).

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

As far as indexing the files, Splunk should be okay with that. All you might need is to increase settings for TRUNCATE, which cuts off lines after 50000 characters. Splunk can handle lines of a few million characters at least, though I'm not sure how the UI will do in certain browsers.

However, for pulling fields out, you can do things a couple of ways. You can either have Splunk generate the field extraction configs from the file contents as it reads and indexes them (here), or you can generate them yourself into the props.conf and transforms.conf files.

I have no idea at what point, if any, a large number of fields (and I am suspicious about the necessity and meaningfulness of a table with purportedly 150k fields) will cause either the generation of configs or the extraction of fields using those configs to fail.

gkanapathy
Splunk Employee
Splunk Employee

Also, how many files? are the files at all patterned by name, e.g., files with a certain path have one common set of field, with another it will have another shared set?

0 Karma

gkanapathy
Splunk Employee
Splunk Employee

not to be all, whatever, but 150k field names? how might these be generated or even used? i find it hard to imagine that there could not be some normalization performed on these to make it rather more manageable and meaningful.

0 Karma
Get Updates on the Splunk Community!

Index This | I am a number, but when you add ‘G’ to me, I go away. What number am I?

March 2024 Edition Hayyy Splunk Education Enthusiasts and the Eternally Curious!  We’re back with another ...

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...