Sunday, December 12, 2021

notes/links about log collection, storage, and searching

Introduction

Just some notes about log collection, storage, and searching.

I just want to be able to store some log data for a long time and do searches on it later in the future, once in a while. I'm not trying to produce a report with the data or do alerting or transport the logs securely.

One of my use cases is collecting network data and storing that for a long time and maybe searching for a specific domain or IP in the future that could've been related to a security incident. 

Similar for incoming http traffic. I'd like to see if someone tried to access a specific URI a really long time ago. (maybe when vuln related to that URI wasn't public at the time)

(leaving out elasticsearch-based things, splunk, and cloud-based services)

notes/links should help w/ research if anyone else is trying to do the same thing as me


Gathering & shipping logs:

For Windows Event Logs:

- fluentbit - https://docs.fluentbit.io/manual/pipeline/inputs/windows-event-log

- fluentd - https://docs.fluentd.org/input/windows_eventlog

- nxlog - https://nxlog.co/docs/nxlog-ce/nxlog-reference-manual.html#im_msvistalog

- winlogbeat - https://www.elastic.co/downloads/beats/winlogbeat-oss

- promtail - https://grafana.com/docs/loki/latest/clients/promtail/scraping/#windows-event-log


- Windows event forwarding - https://docs.microsoft.com/en-us/windows/security/threat-protection/use-windows-event-forwarding-to-assist-in-intrusion-detection WEF sends logs from all the hosts to one collector host

For other text file based logs (linux, webapp, etc..)

- all the tools above

- vector - https://vector.dev/components/

- filebeat - https://www.elastic.co/downloads/beats/filebeat-oss

- rsyslog - https://www.rsyslog.com/

- syslog-ng - https://www.syslog-ng.com/products/open-source-log-management/

- logstash - https://www.elastic.co/downloads/logstash-oss


some of the tools listed above can take in forwarded events (syslog, logtash/beats, etc) from other products and tools as well. 


- kafka - https://kafka.apache.org/ another option for just getting logs from various sources and forwarding them to some other place


input/output, sources/sinks:

- kafka - https://cwiki.apache.org/confluence/display/KAFKA/Ecosystem

- vector - https://vector.dev/components/

- fluentbit - https://docs.fluentbit.io/manual/pipeline/inputs 

https://docs.fluentbit.io/manual/pipeline/outputs

- fluentd - https://docs.fluentd.org/input

https://docs.fluentd.org/output

- logstash - https://www.elastic.co/guide/en/logstash/current/input-plugins.html

https://www.elastic.co/guide/en/logstash/current/output-plugins.html

- rsyslog - https://www.rsyslog.com/plugins/


Log processing:

You may want to process the data to drop certain events or append data to some events. For example, for network data, you may want to use a filter that adds geoip info. You may also want to rename fields.

Many of the collectors and shippers listed above already have some ability to modify or parse the log data. 

Some of the tools are calling these plugins/modules filter or processing or transformer. You may also be able to write your own plugins or some code (some tools above support Lua) to change the logs before output part happens.

Depending on the type of processing you may want to do, you may need to output the logs into a different format that your application understands then process it and put it back into the pipeline for the next step or storage.

For kafka, I found faust (https://faust.readthedocs.io/en/latest/) but there are other libraries too for python and other langs.


Log storage:

The output part in almost all the tools listed above can send data to various places where logs can be index and/or stored. 

You can always store logs to disk on one host w/ compression (obviously searching this is not very fun). Files can also be stored in the cloud. Everything pretty much has s3 output support.

For files stored on disk, many of the tools will allow you to select format such as text, json, etc..

Tools such as logrotate can be used to move, compress, or delete the logs (https://linux.die.net/man/8/logrotate)

cron job/scheduled tasks and some scripts can always be used to move, compress, or delete files as well. 

For being able to easily store and search logs, there is Grafana Loki - https://www.boredhackerblog.info/2021/11/collecting-unifi-logs-with-vector-and.html

Grafana Loki is somewhat similar to elasticsearch or splunk and you can use Grafana webui to query the data.

While doing more research, I came across clickhouse (which is also supported by some of the tools above) (https://clickhouse.com/) Clickhouse can store json data and you can do sql queries on that data. 

I also came across cloki, which is using clickhouse but emulating loki (https://github.com/lmangani/cloki)

The backend is a clickhouse database and you push logs into loki emulator, just like you'd push logs into loki. cloki also supports the same query language as loki and will work with grafana loki connector.


Log search:

Searching the logs depends on how they're stored obviously. For uncompressed or compressed logs, tools such as grep or zgrep or ripgrep (https://github.com/BurntSushi/ripgrep) can be used for searching.

On Windows, there are a few tools that can be used to search and/or query logs. Fileseek (https://www.fileseek.ca/) can be used to search a bunch of files. There is Logfusion (https://www.logfusion.ca/) as well which can be used to read log files.

There is also Log Parser Lizard (https://lizard-labs.com/log_parser_lizard.aspx) which can be used to query log files and even save queries and produce charts or reports.

Files can also be loaded into python w/ pandas for searches, complex searches, or statistical analysis. Pandas supports loading various file types. (https://pandas.pydata.org/docs/reference/io.html)

Finally, if you end up using loki or cloki, grafana can be used to do queries. Grafana also has connectors/plugins for other database/log storage systems. 


Sample logs:

To play with any of the tools above without making changes in production env, you can use sample logs or data sources.

https://github.com/logpai/loghub - github repo that links to several sample logs

https://www.secrepo.com/ - logs related to security. there are some network traffic logs in there

https://www.sec.gov/dera/data/edgar-log-file-data-set.html - EDGAR log files

https://log-sharing.dreamhosters.com/ - various log files

https://www.logs.to/ - log generator (various types)

https://github.com/mingrammer/flog - log generator

https://certstream.calidog.io/ - certificate transparency logs

http://www.hivemq.com/demos/websocket-client/ / broker.mqttdashboard.com - If you want to grab MQTT demo data. I'm pretty sure people are using this for free for their projects too...



ps: i'm not an engineer or an observability expert. Implementation of various tools above varies and may have impact on resource usage.