I have several questions about data architecture that are rooted in CIM data models and performance considerations.
Background: We have about 2T of new log data every day. Some sourcetypes get 100's of M of new events per day, one gets 1.1 B new events per day, quite a few get a few M new events per day. From a data architecture standpoint, we generally drop our events from a given log generator type into a index and sourcetype for the technology, such as windows events go into index = win sourcetype = win. These are not the real names, but you get the idea.
When evaluating the CIM data models, windows events span a range of data models, depending on the event type.
As an example, Windows events can potentially be a part of the following CIM data models (list not complete) - Alerts Application State Authentication Certificates Inventory etc...
Questions: Given that we have massive data considerations and this could adversely affect the performance of any given search, wouldn't it be prudent to create a data architecture that would sort data into smaller piles by index and sourcetype that more closely mimics the CIM data models?
Would changing our sourcetype for windows events from sourcetype = win to sourcetype = win-authentication and sourcetype = win-application-state (et. al.) have significant implications on performance and potentially reduce the search target area of a given model from a really big 'pile' to a smaller, more specific 'pile' of event types?
Would such a data architecture give noticeably better performance improvements over data model acceleration or in addition to data model acceleration or would it be a wash?
Does anyone else out there leverage any data architecture based designs at the index and sourcetype levels for their data due to performance concerns? If so, can you give an example of your data architecture design and ballpark volumes of data? What other considerations may have led you to that data architecture design?
Are there any flaws in this line of thinking? Is it potentially too much work to manage when contrasted with potentially small performance gains? Are the performance gains worth the overhead of setting up and maintaining the data architecture?
... View more