The SQL editor has the following:-
Select
srp_ordernum,
srp_source,
srp_hof_seqno,
srp_seqno_in_hand_off,
srp_entry_type,
srp_entry_type_desc,
srp_date,
srp_initials,
srp_surname,
srp_location,
srp_extn,
srp_text,
ROW_NUMBER()
OVER (ORDER BY srp_date,srp_ordernum,srp_seqno_in_hand_off,srp_hof_seqno) Sequence_no
FROM sword.service_request_progress
WHERE srp_date > '01-Aug-2018'
ORDER BY srp_date
The SQL works in Batch mode, but when i switch to rising column, I get the error java.sql.SQLException: Invalid column index
DBX 3.1.3 instructs me to use the Sequence_no in the WHERE , but when i do i get the following error:
java.sql.SQLSyntaxErrorException: ORA-00904: "SEQUENCE_NO": invalid identifier
row_number() is defined at runtime of the query. If you only do inserts, never updates or deletes, then you can just use srp_date as your rising column.
I had thought about that, but the SRP_DATE is only accurate to the minute and there could be thousands of entries in a minute. This led me to assume that there would be duplications in the ingestion every time the cron triggered. I will give it a go and let you know the outcome
I have just checked the data in SPLUNK for yesterday, unfortunately the DBX input is missing 10-15% of events when I compare to the source table on the SQL server. Any other suggestions ?
The only way I know of would be to use >= instead of > so you won't miss events. The problem now is that you will have duplicate events. If you have some columns that can identify a unique record, you can just use "dedup" in SPL.
I've thought about this problem...a lot.
Note that timestamps are not ideal for rising columns, though they often are the best available choice. Using a timestamp for rising column can produce the following problem conditions:
A high rate of event generation can cause data duplication or loss, because checkpointing in a stream of timestamp-distinguished records assumes there is never more than one row created in a given time. If you set the time to a one second level of resolution and get five records per second, you lose or duplicate four records from every run.
Clock skew, NTP skew corrections, physical moves between timezones, and daylight savings events can cause data mis-ordering, duplication, or loss. If the skew is towards the future, then the resulting checkpoint value may temporarily or permanently stop data collection.
Non-numeric datetime values cannot be evaluated numerically, and lexical sorting can produce unpredictable results. If time series data is ordered lexically, then the resulting checkpoint value may temporarily or permanently stop data collection.
I have changed my SQL to read:
FROM sword.service_request_progress
WHERE srp_date >= ?
ORDER BY srp_date asc
and performed a dedup in the search
however I am still facing an issue with a large number of missing events. I cannot understand why the Row() function worked on the old version of DBX but not the new, seems a backward step if you cannot add calculated fields to your data at this stage
Any other ideas welcome.
Where does srp_date get set? Is it a database trigger or in some other app? Can events come into the db out of order?
You can change your order by to be the same as the order by in the row_number() analytic function if that does in fact give a definitive sort order:
ORDER BY srp_date,srp_ordernum,srp_seqno_in_hand_off,srp_hof_seqno
However, I don't see that would materially change the behavior in terms of losing events.