使用SPL的正则表达式解析Nginx日志-日志服务-阿里云

Nginx访问日志记录了用户访问的详细信息，解析Nginx访问日志对业务运维具有重要意义。本文介绍如何使用正则表达式函数解析Nginx访问日志。

解析标准Nginx日志

日志服务支持通过SPL的正则表达式解析Nginx日志。现以一条Nginx成功访问日志为例，介绍如何使用正则表达式解析Nginx成功访问日志。

原始日志

__source__:  192.168.0.1
__tag__:__client_ip__:  192.168.254.254
__tag__:__receive_time__:  1563443076
content: 192.168.0.2 - - [04/Jan/2019:16:06:38 +0800] "GET http://example.aliyundoc.com/_astats?application=&inf.name=eth0 HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.example.com/bot.html)"

解析需求
- 需求1：从Nginx日志中提取出code、ip、datetime、protocol、request、sendbytes、referer、useragent、verb信息。
- 需求2：对request进行再提取，提取出uri_proto、uri_domain、uri_param信息。
- 需求3：对解析出来的uri_param进行再提取，提取出uri_path、uri_query信息。

SLS SPL编排

总编排

* | parse-regexp content, '(\d+\.\d+\.\d+\.\d+) - - \[([\s\S]+)\] \"([A-Z]+) ([\S]*) ([\S]+)["] (\d+) (\d+) ["]([\S]*)["] ["]([\S\s]+)["]' as ip, datetime,verb,request,protocol,code,sendbytes,refere,useragent
  | parse-regexp request, '^(\w+):\/\/([^\/]+)(\/.*)$' as uri_proto, uri_domain, uri_param
  | parse-regexp uri_param, '([^?]*)\?(.*)' as uri_path, uri_query

细分编排及对应加工结果

针对需求1解析Nginx日志的加工编排如下。

* | parse-regexp content, '(\d+\.\d+\.\d+\.\d+) - - \[([\s\S]+)\] \"([A-Z]+) ([\S]*) ([\S]+)["] (\d+) (\d+) ["]([\S]*)["] ["]([\S\s]+)["]' as ip, datetime,verb,request,protocol,code,sendbytes,refere,useragent

对应结果：

__source__:  192.168.0.1
__tag__:  __receive_time__:  1563443076
code:  200
content:  192.168.0.2 - - [04/Jan/2019:16:06:38 +0800] "GET http://example.aliyundoc.com/_astats?application=&amp;inf.name=eth0 HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.example.com/bot.html)"httpversion:  1.1
datetime:  04/Jan/2019:16:06:38 +0800
ip:  192.168.0.2
protocol:  HTTP/1.1
refere:  -
request:  http://example.aliyundoc.com/_astats?application=&amp;inf.name=eth0
sendbytes:  273932
useragent:  Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.example.com/bot.html)
verb:  GET

针对需求2解析request，SPL编排如下。

* | parse-regexp request, '^(\w+):\/\/([^\/]+)(\/.*)$' as uri_proto, uri_domain, uri_param

对应结果：

uri_param: /_astats?application=&inf.name=eth0
uri_domain: example.aliyundoc.com
uri_proto: http

针对需求3解析uri_param，SPL编排如下。

* | parse-regexp uri_param, '([^?]*)\?(.*)' as uri_path, uri_query

对应结果：

uri_path: /_astats
uri_query: application=&inf.name=eth0

SPL最终处理结果

__source__:  192.168.0.1
__tag__:  __receive_time__:  1563443076
code:  200
content:  192.168.0.2 - - [04/Jan/2019:16:06:38 +0800] "GET http://example.aliyundoc.com/_astats?application=&amp;inf.name=eth0 HTTP/1.1" 200 273932 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.example.com/bot.html)"httpversion:  1.1
datetime:  04/Jan/2019:16:06:38 +0800
ip:  192.168.0.2
protocol:  HTTP/1.1
refere:  -
request:  http://example.aliyundoc.com/_astats?application=&amp;inf.name=eth0
sendbytes:  273932
uri_domain:  example.aliyundoc.com
uri_proto:  http
uri_param: /_astats?application=&inf.name=eth0
uri_path: /_astats
uri_query: application=&inf.name=eth0
useragent:  Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.example.com/bot.html)
verb:  GET

解析非标准Nginx日志

场景一：提取日志中间的关键字

根据正则表达式，从message中提取出日志的中间的Time，Level，Server，Info值，使用parse-regexp进行编排。

示例

原始日志

{"message": "[2024-10-11 10:30:34.917962]\t[info]\t[SingleWorldService]\t[ResourceManager:testOut for 2, srvClusterId=1009]\t[[]     ...ewEntities/ResourceServiceComponent/ResourceManager.out:190]"}

SPL编排

*| parse-regexp message, '\[([^[\]]+)\]\s+\[([^[\]]+)\]\s+\[([^[\]]+)\]\s+\[([^[\]]+)\]' as Time,Level,Server,Info

加工结果

Time:2024-10-11 10:30:34.917962
Level:info
Server:SingleWorldService
Info:ResourceManager:testOut for 2, srvClusterId=1009
message:[2024-10-11 10:30:34.917962]	[info]	[SingleWorldService]	[ResourceManager:testOut for 2, srvClusterId=1009]	[[]     ...ewEntities/ResourceServiceComponent/ResourceManager.out:190]

场景二：从日志中根据正则解析特定值

根据正则表达式，从content中提取出RequestTime，traceId，ThreadName，LogLevel，ClassName，LineNum，LogInfo字段值，使用parse-regexp进行编排。

示例

原始日志

{"content":"2023-11-11 14:47:17.844 [12] [backup-test-thread] INFO com.shidsds.dus.service.BackTestService 109 | 备份缓存 1021 秒前已刷新，本次跳过：backupCache:com.shidsds.dus.service.DuuewwService:lastRefreshTime"}

SPL编排

*| parse-regexp content, '([\d\-]{10}\s+[\d:\.]{12})\s+\[([^[\]]+)\]\s+\[([^[\]]+)\]\s+([\S]+)\s+([\S]+)\s+([\d]+)\s+\|\s+(.*)' as RequestTime,traceId,ThreadName,LogLevel,ClassName,LineNum,LogInfo

加工结果

ClassName:com.shidsds.dus.service.BackTestService
LineNum:109
LogInfo:备份缓存 1021 秒前已刷新，本次跳过：backupCache:com.shidsds.dus.service.DuuewwService:lastRefreshTime
LogLevel:INFO
RequestTime:2023-11-11 14:47:17.844
ThreadName:backup-test-thread
content:2023-11-11 14:47:17.844 [] [backup-test-thread] INFO com.shidsds.dus.service.BackTestService 109 | 备份缓存 1021 秒前已刷新，本次跳过：backupCache:com.shidsds.dus.service.DuuewwService:lastRefreshTime
traceId:12