Splunk Search

Counting regex matches only if another field is unique

splunknoob408
Explorer

I've got a ordering log that includes two fields, order_id and shipped_date. I am playing with Splunk to see how hard it is to generate dashboards that give me visibility into order processing, so I'd like to be able to count the number of orders shipped each day.

I have the search working after I created a custom field using a regex. The problem is that when I run the process that generates the log, the log (postgres database) can end up with duplicate entries. For example, it might add a record for order_id "12345" with a shipped_date "12/1/2017" if I run the process multiple times. It doesn't affect my overall processes, but for dashboard generation, this gives a very incorrect view of shipments.

So my question is, is this something that can be taken care of by search easily, or is this something where I have to complicate the DB Connect source by doing it all in SQL? I think this means that my DB Connect source wouldn't be able to digest the entire DB table, but instead I'd have to tailor it to a specific use case. In this case, that would mean something like only returning unique results.

Tags (2)
0 Karma
1 Solution

DalJeanis
Legend

| dedup orderid will take the first record returned by the search for each value of orderid, and discard all subsequently encountered events with the same value of orderid.

Splunk in general returns most recent first, so unless you are manipulating the order, or you are comparing records that somehow got no _time or got the same _time, then the one retained will be the most recent.

View solution in original post

DalJeanis
Legend

| dedup orderid will take the first record returned by the search for each value of orderid, and discard all subsequently encountered events with the same value of orderid.

Splunk in general returns most recent first, so unless you are manipulating the order, or you are comparing records that somehow got no _time or got the same _time, then the one retained will be the most recent.

somesoni2
Revered Legend

You can use dedup to remove duplicates OR you can use dc (distinct count) instead of simple count function in your chart/stats/timechart. I'm guessing your query will be in line of ....| stats dc(order_id) as shipped_order by shipped_date.

splunknoob408
Explorer

I may have just found my answer, but I'm not sure. I wonder if piping to dedup actually does what I think, e.g. | dedup order_id ?

0 Karma

wenthold
Communicator

If you're doing DB queries using DB connect you'll want to see if there's something you can use as a rising column, this will prevent duplicate entries. For example, if your order numbers are unique and ascending values you can use that as the rising column, and DB connect won't index duplicated data.

One thing I've done where I didn't have a rising column in a database is convert timestamps to epoch times in the query syntax and use that as a rising column, but you would really need your timestamps to include milliseconds to do this to ensure uniqueness. It just depends on your data.

If you can't use a rising column in the database, using dedup should do the trick.

0 Karma

splunknoob408
Explorer

I'm already using a timestamptz field as the rising column. My understanding is that this will prevent duplicate rows from being pulled in, and it will make the updates faster as it will only read from the checkpoint. However, the order_id entries can still be duplicated. If I run my service twice, and if the query that it executes returns the same results twice, then that data will end up in my DB twice as well. I want to prevent Splunk from counting multiples of the same data.

Dedup definitely returns less data, but I just have to verify that this data is correct. 🙂

0 Karma
Get Updates on the Splunk Community!

Webinar Recap | Revolutionizing IT Operations: The Transformative Power of AI and ML ...

The Transformative Power of AI and ML in Enhancing Observability   In the realm of IT operations, the ...

.conf24 | Registration Open!

Hello, hello! I come bearing good news: Registration for .conf24 is now open!   conf is Splunk’s rad annual ...

ICYMI - Check out the latest releases of Splunk Edge Processor

Splunk is pleased to announce the latest enhancements to Splunk Edge Processor.  HEC Receiver authorization ...