DBX DB Connect Issue - Duplication, Non-Historical...

mcrawford44 · ‎01-09-2014

All,

Forgive the lengthy post, I'm trying to be thorough. I believe our issue is with the fundamental feature or intended use of this application. It appears that DBX follows Splunk in the mindset that all data must be tailing or rolling and include a timestamp. When data is "static" or "self updating records"; instead of updating existing index items, a duplicate event is created.

Our issue arises on DB sources that don't tail or roll. He is an example input;

[column_A] [column_B]
KEY1       DATA1

Creating a DB input with a tail won't work, as there is no rising column. You'd think that you could use '[column_A]', and it DOES append new rows. However, if any data changes on a previous record than the current rising column, no updates or changes occur to existing data. For example; A 'KEY2' record is created. Splunk appends that record to the index so you now have;

[column_A] [column_B]
KEY1       DATA1
KEY2       DATA2

These items are not time-based. If 'DATA1' changes to 'DATA1abc', even with a rising column; no records are updated in the index, because they rising column is expecting a 'KEY3'.

If we use a scheduled dump instead of a tail, instead of updating records based on a matching key; duplicate data is created. Splunk does this to account for historical data, expecting a timestamp or creating it's own during a dump. This is what you get in that scenario;

[column_A] [column_B]
KEY1       DATA1
KEY2       DATA2
KEY1       DATA1abc

I know I can create a timed report, that dumps the table to a CSV using outputlookup. It overwrites the CSV every run, however the syntax for our users changes since lookups are handled differently. You also loose native cross index query ease. It also creates more traffic on the SQL servers than needed.

My question; is there some feature or some setting I'm missing that can overwrite the attached index, or update indexed events in place as they are updated on the DB side? I feel like this should be a built in feature as most DB connections won't be tailing logs, but flat data in tables without timestamps that update in place.

ViniciusANT · ‎09-25-2014

is there any way to force the dbmon dump to run ? you could delete de indexed data and reindex forcing the db-mon dump ...

helge · ‎02-24-2014

It appears that Splunk is not suited well for data where mostly static records are updated or deleted - as far as I know there is no way to search in such a way that only the latest versions of each record are returned - excluding deleted records, of course. At least not in tail mode.

In dump mode it should work if you create the dumps at fixed points in time (e.g. each full hour). During search all you need to do is select the time range in such a way that it includes the last full hour, e.g. search from 0:55 to 1:05.

Of course continuously dumping mostly unchanged data into Splunk is extremely ineffective and unnecessarily creates a huge data volume.

helge · ‎02-24-2014

Having an object audit trail is certainly a nice thing to have. It requires two things, though: 1) a unique ID per object that never gets reused, and 2) a dedicated field to mark an object as deleted.

mcrawford44 · ‎02-24-2014

The 'dedup' command returns the latest matching event from the event creation, or one can specify;

stats latest(custom_timestamp)

The more I work with Splunk, the more value is apparent in historical data. You can create queries in SPL that show the life of a record, similar to audit trails.

It's really a change drastic in concept from self updating records, but we've been able to work around it by creating DBX inputs in a specific format that benefits the historical references.

Currently you can mark events for deletion, but it does not recover disk space. Would be a nice feature.

araitz · ‎01-09-2014

Since Splunk is a reverse time-series index, every event that Splunk indexes is given a time stamp. If Splunk cannot find a timestamp in the event itself, it assigns a timestamp of the current system time on the indexer where the event is being indexed.

You can create a batch or "dump" input that dumps the entire table to the index of your choice, similar to a tailing input but without the rising column:

http://docs.splunk.com/Documentation/DBX/latest/DeployDBX/Configuredatabasemonitoring

Every time the dump runs, the events will be assigned a timestamp corresponding with the time that the dump occurred.

If you had an indexed column in your table that contained the last modification time for the row in question, you could of course use that as your rising column.

To update in place would involve a pretty complex script that compares the current values in the table to the most recently indexed value in Splunk that corresponds to the tuple in question. This seems painstaking and error prone to me.

Another option to consider would be a trigger on the database side that sends events to Splunk (or perhaps to a flat file that Splunk is monitoring) as the tuples update.

mcrawford44 · ‎01-10-2014

The expected function of what I'm aiming for is ease of use for end users, limiting SQL traffic, lowering index space requirements, and lowering license usage.

With look-ups, the data isn't as live as could be. The syntax is also different. While that works for advanced users, I'd rather have the option to compare snapshots of tables via; 'index=foo index=bar', etc.

We could do this all from SQL scripts, sure but I'd like live, up-to-date DB reports being generated in Splunk, without duplicate data, without having to delete records, and without having to update report queries every run.

araitz · ‎01-09-2014

Yes, lookups do not have to be time series in nature since at their simplest they simply decorate or enrich existing events or results, though temporal lookups are possible and might be another area for you to explore: http://docs.splunk.com/Documentation/Splunk/6.0.1/Knowledge/Usefieldlookupstoaddinformationtoyoureve...

araitz · ‎01-09-2014

You are probably not the only shop in the world with that use case. However, the vast majority of customers I have talked to are using DB Connect to get data from Splunk to a RDB, from an RDB to Splunk, or as a way to enrich data in Splunk with data from a DB.

To restate my point from earlier, you could certainly script something to meet your goals, but I'm not sure I understand what you would hope to gain from this use case. Splunk is designed to store and search upon time series data. Maybe if I understood how you intended to search the data in Splunk it would help me better understand.

mcrawford44 · ‎01-09-2014

I don't mind that answer but I think it's a limitation that can easily be overcome. I'm not exactly asking for the world here, it's a data source. Excluding non-timed events seems like a completely self imposed limitation. Same with events that don't update.

CSV lookup and SQL query is possible without stamps, but indexes are not? I don't buy it. You're missing an entire selling point, simple indexed reporting on databases that you can cross compare with other inputs.

mcrawford44 · ‎01-09-2014

I understand the indexes can not be updated, but they can be marked for removal and excluded from searches and queries.

I really find it difficult to believe that we are the only shop with database connections that aren't tailing logs.

As it stands, the functionality of DBX is limited to sequential event logs stored in DB's that never update, only append. Anything else where records are static, or updated in place can not be indexed without complication.

araitz · ‎01-09-2014

I'm not sure I was able to accurately convey what I meant with regard to update in place. Each row in Splunk's indexes are essentially "WORM". The automation would involve a scripted input that performs the automation of comparing what is in Splunk's most recent events to what is currently in the DB Table.

I guess the bottom line is that there is not a way to do this with DB Connect out of the box. If other folks are interested in doing something like this, I haven't heard from them yet.

mcrawford44 · ‎01-09-2014

Using the DB to trigger and push data to Splunk butts up against the same exact issue. Nothing would change. You'd still get duplicates without the ability to update in place.

A flat file lookup such as a csv uses a different syntax, is not indexed, and is just a superfluous step since you're constantly dumping for the db anyway. I'd just use the native dbquery command in DBX.

I don't mean to be rude, but you're not either understanding the issue or misunderstanding available options.

mcrawford44 · ‎01-09-2014

Understood, but you run into the same issues as the other answer. One would have to edit queries after every dump to only reference the most recent timestamps. Negating any automation.

Updating in place is not complex in the slightest. DBX could easily include a field to use a primary key, and update matching records. SQL's update command does this, so python would clearly have a native record set update function or one can easily be built.

richgalloway · ‎01-09-2014

One cannot change, modify, alter, edit, overwrite, or update data once it is in Splunk.

You should run DB Connect in dump mode and modify your searches to extract the most recent data.

---
If this reply helps you, Karma would be appreciated.

jcoates_splunk · ‎01-10-2014

rising column entry per job? Probably would need a stored procedure to make that happen.

mcrawford44 · ‎01-09-2014

Except in any large data set. The timestamp Splunk appends is per record transaction, so throughout a single dump there are several variations of the stamp.

So stats latest(_time) would not display all results of a dump, only the last grouped set.

richgalloway · ‎01-09-2014

One could use stats latest(_time) to create a generic search for the most recent data set.

---
If this reply helps you, Karma would be appreciated.

mcrawford44 · ‎01-09-2014

This would completely negate automated reports, as with every single dump, one would have to change the query to match the range of Splunk's generated timestamps per event.

Even if you could not update existing records, you CAN mark them for deletion. So programatically the application could delete a matching row and recreate it with the new data. Caveat that Splunk does not release disk space for marked events, but that's easy to deal with.

DBX DB Connect Issue - Duplication, Non-Historical Function, Overwrite Feature, No Timestamp, etc.

Announcing Scheduled Export GA for Dashboard Studio

Extending Observability Content to Splunk Cloud

More Control Over Your Monitoring Costs with Archived Metrics GA in US-AWS!