Getting Data In

Importing logs from Parquet into Splunk

w344423
Explorer

Hi Guys, I am performing a POC to import our parquet files into splunk, i have manage to write a python script to extract out the events aka raw logs to a df. 

I also did a python script to pump the logs via the syslog protocol to HF than to indexer. I am using the syslog method because i got many log type and i can do this by using the [udp://portnumber] to ingest multiple types of logs at once and to a different sourcetype

however when i do this I am not able to retain the original datatime on the raw event but it is taking the datetime on the point i was sending the event. secondly i am using python because all these parquet files are storing in a s3 container hence it will be easier for me to loop thru the directory and extract the file. 

I was hoping if anyone can help me out how can i get the original timestamp of the logs? Or there are other more effective way of doing this?

sample logs from splunk after index,

- Nov 10 09:45:50 127.0.0.1 <190>2023-09-01T16:59:12Z server1 server2 %NGIPS-6-430002: DeviceUUID: xxx-xxx-xxx

heres my code to push the event via syslog. 

import logging
import logging.handlers
import socket
from IPython.display import clear_output


#Create you logger. Please note that this logger is different from ArcSight logger.
#my_loggerudp = logging.getLogger('MyLoggerUDP')
#my_loggertcp = logging.getLogger('MyLoggerTCP')

#We will pass the message as INFO
my_loggerudp.setLevel(logging.INFO)

#Define SyslogHandler

#TCP
#handlertcp = logging.handlers.SysLogHandler(address = ('localhost',1026), socktype=socket.SOCK_STREAM)

#UDP
handlerudp = logging.handlers.SysLogHandler(address = ('localhost',1025), socktype=socket.SOCK_DGRAM)

#X.X.X.X =IP Address of the Syslog Collector(Connector Appliance,Loggers etc.)

#514 = Syslog port , You need to specify the port which you have defined ,by default it is 514 for Syslog)
my_loggerudp.addHandler(handlerudp)
#my_loggertcp.addHandler(handlertcp)

#Example: We will pass values from a List

event = df["event"]
count = len(event)
#for x in range(2):
for x in event:
clear_output (wait=True)
my_loggerudp.info(x)
my_loggerudp.handlers[0].flush()
count -= 1
print(f"logs left to be transmit {count}")
print (x)

 

Labels (5)
0 Karma

richgalloway
SplunkTrust
SplunkTrust

IMO, syslog should the onboarding choice of last resort.  There are too many syslog "standards" and issues always arise (like yours).

Since you're building your own ingestion program, consider sending the data to Splunk using HTTP Event Collector (HEC).  See "To Add Data Directly to an Index" at https://dev.splunk.com/enterprise/docs/devtools/python/sdk-python/howtousesplunkpython/howtogetdatap...

---
If this reply helps you, Karma would be appreciated.
0 Karma
Get Updates on the Splunk Community!

Detecting Remote Code Executions With the Splunk Threat Research Team

REGISTER NOWRemote code execution (RCE) vulnerabilities pose a significant risk to organizations. If ...

Observability | Use Synthetic Monitoring for Website Metadata Verification

If you are on Splunk Observability Cloud, you may already have Synthetic Monitoringin your observability ...

More Ways To Control Your Costs With Archived Metrics | Register for Tech Talk

Tuesday, May 14, 2024  |  11AM PT / 2PM ET Register to Attend Join us for this Tech Talk and learn how to ...