I've got a ordering log that includes two fields, order_id and shipped_date. I am playing with Splunk to see how hard it is to generate dashboards that give me visibility into order processing, so I'd like to be able to count the number of orders shipped each day.
I have the search working after I created a custom field using a regex. The problem is that when I run the process that generates the log, the log (postgres database) can end up with duplicate entries. For example, it might add a record for order_id "12345" with a shipped_date "12/1/2017" if I run the process multiple times. It doesn't affect my overall processes, but for dashboard generation, this gives a very incorrect view of shipments.
So my question is, is this something that can be taken care of by search easily, or is this something where I have to complicate the DB Connect source by doing it all in SQL? I think this means that my DB Connect source wouldn't be able to digest the entire DB table, but instead I'd have to tailor it to a specific use case. In this case, that would mean something like only returning unique results.
| dedup orderid
will take the first record returned by the search for each value of orderid
, and discard all subsequently encountered events with the same value of orderid
.
Splunk in general returns most recent first, so unless you are manipulating the order, or you are comparing records that somehow got no _time
or got the same _time
, then the one retained will be the most recent.
| dedup orderid
will take the first record returned by the search for each value of orderid
, and discard all subsequently encountered events with the same value of orderid
.
Splunk in general returns most recent first, so unless you are manipulating the order, or you are comparing records that somehow got no _time
or got the same _time
, then the one retained will be the most recent.
You can use dedup to remove duplicates OR you can use dc
(distinct count) instead of simple count
function in your chart/stats/timechart. I'm guessing your query will be in line of ....| stats dc(order_id) as shipped_order by shipped_date
.
I may have just found my answer, but I'm not sure. I wonder if piping to dedup actually does what I think, e.g. | dedup order_id
?
If you're doing DB queries using DB connect you'll want to see if there's something you can use as a rising column, this will prevent duplicate entries. For example, if your order numbers are unique and ascending values you can use that as the rising column, and DB connect won't index duplicated data.
One thing I've done where I didn't have a rising column in a database is convert timestamps to epoch times in the query syntax and use that as a rising column, but you would really need your timestamps to include milliseconds to do this to ensure uniqueness. It just depends on your data.
If you can't use a rising column in the database, using dedup should do the trick.
I'm already using a timestamptz field as the rising column. My understanding is that this will prevent duplicate rows from being pulled in, and it will make the updates faster as it will only read from the checkpoint. However, the order_id entries can still be duplicated. If I run my service twice, and if the query that it executes returns the same results twice, then that data will end up in my DB twice as well. I want to prevent Splunk from counting multiples of the same data.
Dedup definitely returns less data, but I just have to verify that this data is correct. 🙂