25
loading...
This website collects cookies to deliver better user experience
2021-12-02T14:00:00,000Z DEBUG This is
a message that
spans multiple linees
2021-12-02T14:00:01,000Z DEBUG Single-line-message
2021-12-02T14:00:02,000Z DEBUG Another message
2021-12-02T14:00:03,000Z INFO This is
another multi-line message
+-------------------------------------------------------+
| 2021-12-02T14:00:00,000Z DEBUG This is |
+-------------------------------------------------------+
| a message that |
+-------------------------------------------------------+
| spans multiple linees |
+-------------------------------------------------------+
| 2021-12-02T14:00:01,000Z DEBUG Single-line-message |
+-------------------------------------------------------+
| 2021-12-02T14:00:02,000Z DEBUG Another message |
+-------------------------------------------------------+
| 2021-12-02T14:00:03,000Z INFO This is |
+-------------------------------------------------------+
| another multi-line message |
+-------------------------------------------------------+
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
import pyspark.sql.functions as f
# Change this to your data source
S3_INPUT_PATH = "s3a://<my-log-bucket>/"
SC = SparkContext.getOrCreate()
SPARK = SparkSession(SC)
# Load all files as individual records, i.e. each
# record has the path as _1 and the content as _2
logs_df = SC.wholeTextFiles(S3_INPUT_PATH).toDF()
_1
contains the path to the file and _2
its content. (Note: I'd avoid printing the column _2
in jupyter notebooks, in most cases the content will be too much to handle.) This is important, because treating the file as a whole allows us to use our own splitting logic to separate the individual log records.split
function in combination with the explode
function like this:multiline_str_df = logs_df.select(
f.explode(
f.split("_2", r"(?=\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}[\S\s]*\t)")
).alias("value")
)
\t
).11A11A11
at A
would yield [11, 11, 11]
and we'd lose the A
. If we now split at the timestamp, we'd lose it, which is not good. This is where regular expressions can help. They allow for a look-ahead match. The details don't really matter, but if you start a capture group with ?=
it will match everything before the pattern. By using a look ahead capture group, we're able to match everything before the timestamp:(?=\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3}[\S\s]*\t)
multiline_str_df
looks roughly like this:+-------------------------------------------------------+
| value |
+-------------------------------------------------------+
| 2021-12-02T14:00:00,000Z DEBUG This is |
| a message that |
| spans multiple linees |
+-------------------------------------------------------+
| 2021-12-02T14:00:01,000Z DEBUG Single-line-message |
+-------------------------------------------------------+
| 2021-12-02T14:00:02,000Z DEBUG Another message |
+-------------------------------------------------------+
| 2021-12-02T14:00:03,000Z INFO This is |
| another multi-line message |
+-------------------------------------------------------+
REGEX_PATTERN = r'(\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2},\d{3})[\S\s]*\t([\S\s]*?)\s*\t([\s\S]*)'
# 1: Timestamp
# 2: Log Level
# 3: Message
log_data_df = multiline_str_df.select(
f.regexp_extract('value', REGEX_PATTERN, 1).alias('timestamp'),
f.regexp_extract('value', REGEX_PATTERN, 2).alias('log_level'),
f.regexp_extract('value', REGEX_PATTERN, 3).alias('message'),
)
log_data_df
will look like this and you can do further processing based on that:+---------------------------------------------------------------+
| timestamp | log_level | message |
+---------------------------------------------------------------+
| 2021-12-02T14:00:00,000Z | DEBUG | This is |
| | | a message that |
| | | spans multiple linees |
+---------------------------------------------------------------+
| 2021-12-02T14:00:01,000Z | DEBUG | Single-line-message |
+---------------------------------------------------------------+
| 2021-12-02T14:00:02,000Z | DEBUG | Another message |
+---------------------------------------------------------------+
| 2021-12-02T14:00:03,000Z | INFO | This is |
| | | another multi-line[...]|
+---------------------------------------------------------------+
?=
). I recommend you use something like regex101.com to tinker with your regular expression until it works as you want to use it.