阿里云内容分发网络(Content Delivery Network,CDN)是建立并覆盖在承载网之上,由分布在不同区域的边缘节点服务器群组成的分布式网络。CDN分担源站压力,避免网络拥塞,确保在不同区域、不同场景下加速网站内容的分发,提高资源访问速度。您可以将CDN日志转存到OSS,或者将CDN日志配置实时日志推送,然后进行日志分析。接下来您就可以在不移动日志文件的情况下通过DLA分析OSS中的CDN日志数据,及时发现问题,并有针对性的解决问题,提升CDN服务质量。

前提条件

通过DLA分析CDN日志数据之前,您需要先完成以下准备工作:

步骤一:将CDN日志转存到OSS

将CDN日志转存到OSS,详情请参见日志转存

步骤二:创建OSS Schema

假设OSS Bucket中存储了以下CDN日志文件。

您可以在DMS页面编写SQL创建OSS Schema、创建日志文件表以及读取日志文件数据,也可以通过MySQL客户端或者MySQL命令行工具连接DLA,然后编写SQL创建OSS Schema、创建日志文件表以及读取日志文件数据。

​CREATE SCHEMA cdn_log_schema with DBPROPERTIES(
   catalog='oss',
  location = 'oss://bucket-name/cdn/'
  );​

步骤三:创建日志文件表

CDN日志文件数据如下所示。

[9/Jun/2015:01:58:09 +0800] 188.165.15.75 - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" "text/html"

日志字段含义如下表所示。

字段 说明
[9/Jun/2015:01:58:09 +0800] 时间
188.165.15.75 访问IP地址
- 代理IP
1542 responsetime(单位:ms)
"-" referer
"GET http://www.aliyun.com/index.html" request
200 httpcode
191 requestsize(单位:byte)
2830 responsesize(单位:byte)
MISS cache命中状态
"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" UA头
"text/html" 文件类型

在DLA中建表时,需要采用正则表达式对CDN日志进行解析,每个字段和对应的正则表达式如下所示。

字段 说明
[9/Jun/2015:01:58:09 +0800] (-|\\[[^\\]]*\\])
188.165.15.75 ([^ ]*)
- ([^ ]*)
1542 ([^ ]*)
"-" ([^ \"]*|\"[^\"]*\")
"GET http://www.aliyun.com/index.html" ([^ \"]*|\"[^\"]*\")
200 (-|[0-9]*)
191 (-|[0-9]*)
2830 (-|[0-9]*)
MISS ([^ ]*)
"Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://ahrefs.com/robot/)" ([^ \"]*|\"[^\"]*\")
"text/html" ([^ \"]*|\"[^\"]*\")

完整的正则表达式为:(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")

DLA中通过以下SQL创建日志文件表。

​CREATE EXTERNAL TABLE cdn_log (
  log_timestamp           VARCHAR,
  access_ip               VARCHAR,
  proxy_ip                VARCHAR,
  response_time           VARCHAR,
  referer                 VARCHAR,
  request                 VARCHAR,
  httpcode                SMALLINT,
  request_size            BIGINT,
  response_size           BIGINT,
  cache_hit_status        VARCHAR,
  ua_header               VARCHAR,
  file_type               VARCHAR
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(-|\\[[^\\]]*\\]) ([^ ]*) ([^ ]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*) (-|[0-9]*) ([^ ]*) ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\")"
)
STORED AS TEXTFILE
LOCATION 'oss://bucket-name/cdn/'
TBLPROPERTIES ('recursive.directories' = 'true');​

步骤四:查询、分析CDN日志

​SELECT * FROM cdn_log;
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| log_timestamp                | access_ip     | proxy_ip | response_time | referer | request                          | httpcode | request_size | response_size | cache_hit_status | ua_header                                                   | file_type                 |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+
| [18/Jun/2019:05:08:33 +0800] | 47.92.115.203 | -        | 777           | "-"     |                                  |      200 |          201 |          7159 | MISS             | "Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D))" | "text/html;charset=UTF-8" |
+------------------------------+---------------+----------+---------------+---------+----------------------------------+----------+--------------+---------------+------------------+-------------------------------------------------------------+---------------------------+​

可以使用DLA系统函数,对数据进行分析。

​SELECT date_parse(substring(log_timestamp, 2, length(log_timestamp) - 8), '%d/%b/%Y:%H:%i:%s') as log_timestamp,
       access_ip,
       proxy_ip,
       response_time,
       substring(referer, 2, length(referer) - 2) as referer,
       substring(request, 2, length(request) - 2) as request,
       httpcode,
       request_size,
       response_size,
       cache_hit_status,
       substring(ua_header, 2, length(ua_header) - 2) as ua_header,
       substring(file_type, 2, length(file_type) - 2) as file_type 
FROM cdn_log;
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| log_timestamp           | access_ip     | proxy_ip | response_time | referer | request                        | httpcode | request_size | response_size | cache_hit_status | ua_header                                                 | file_type               |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+
| 2019-06-18 05:08:33.000 | 47.92.115.203 | -        | 777           | -       |                                |      200 |          201 |          7159 | MISS             | Mozilla/5.0 (Linux; Android 4.1.1; Nexus 7 Build/JRO03D)) | text/html;charset=UTF-8 |
+-------------------------+---------------+----------+---------------+---------+--------------------------------+----------+--------------+---------------+------------------+-----------------------------------------------------------+-------------------------+​