All Apps and Add-ons

What are some best practices for dealing with complex relational database data in Splunk?

grittonc
Contributor

Splunk eats machine data for breakfast, but many of us are using data in Splunk that doesn't come from a machine and isn't easily event-ized.

What are some best practices for dealing with high-volume data from snowflake schemas? This data may change frequently, isn't broken into events, and sometimes requires complex SQL to distill it into events.

Best practices for using DB Connect are most welcome.

0 Karma

woodcock
Esteemed Legend

Here are my best practices for DB Connect.
Do not use v1.
Try to use v3 but expect many problems, some of them insurmountable.
Trust v2 but beware that there is a hardcoded limit that you need to fix (https://answers.splunk.com/answers/233222/splunk-db-connect-2-dbxquery-only-returns-1001-row.html)
Use checkpoints, but try not to use timestamps for this.
Do as much work as possible in SQL (on the DB side).
Don't ingest more than you need; make sure you limit the fields returned.
If things are overly complex, consider creating a custom view inside of your DB and query against that instead of the raw table.

@SloshBurch, we need a validated_best_practice in this area.

ddrillic
Ultra Champion

-- Do as much work as possible in SQL (on the DB side).

This is huge and applies to other software integrations with DBs.

For example, you need a certain type of data-set - create a view that represents this data-set and ingest this data-set, instead of ingesting the raw data and performing the joins within Splunk. In Hunk, with huge data-sets these scenarios were nightmares until we created the proper views.

sloshburch
Splunk Employee
Splunk Employee

You rang? lol

I guess I want to know more about the situation here. I'm not familiar enough with database data that has changing schema. I need to appreciate that to get my head around the challenge.

0 Karma

grittonc
Contributor

It's not the schema that is changing, it's the data. Updates and deletes are not Splunk-friendly. If I've already indexed an event related to entity X and then something about X changes, I need to index a new event for entity X. The old one isn't relevant anymore for most purposes. That means that either users have to search for the latest version of that event, or I need to find a way to delete the old version that is out of date.

Sometimes I use lookup tables instead of indexes. I've also looked at using scheduled searches to do the heavy lifting of finding the latest version of each entity and then having dashboards use loadjob. But end users trying to use the traditional "index=foo" in the search box can easily come up with incorrect conclusions.

0 Karma

sloshburch
Splunk Employee
Splunk Employee

Do you retain a timestamp as a field with a row that is inserted or deleted? If you do then DBConnect could use a cursor follow on a query with ORDER BY of that timestamp field. Then the data is loaded in splunk as a new event and reporting on it uses latest() of a transforming statistics command.

I'm not sure at this time how to do it without that. I think your approach of using a lookup file to cache it is sound as well. But obv depends on the volume of data.

0 Karma
Get Updates on the Splunk Community!

What’s New in Splunk App for PCI Compliance 5.3.1?

The Splunk App for PCI Compliance allows customers to extend the power of their existing Splunk solution with ...

Extending Observability Content to Splunk Cloud

Register to join us !   In this Extending Observability Content to Splunk Cloud Tech Talk, you'll see how to ...

What's new in Splunk Cloud Platform 9.1.2312?

Hi Splunky people! We are excited to share the newest updates in Splunk Cloud Platform 9.1.2312! Analysts can ...