Quickly analyze offline access logs
After you obtain an offline log file, you can use a command-line tool to quickly parse the file and extract information such as the top 10 IP addresses by request volume, user agents, and referrers. This topic shows how to analyze CDN offline access logs with a command-line tool in a Linux environment.
Prerequisites
You have downloaded an offline access log. For details, see Quick Start.
Usage notes
Naming rule for log files: Accelerated domain name_year_month_day_start time_end time[extension field].gz. The extension field starts with an underscore (_). Example:
aliyundoc.com_2018_10_30_000000_010000_xx.gz.NoteNames of specific log files may not contain an extension field. Example:
aliyundoc.com_2018_10_30_000000_010000.gz.Sample log entry
[9/Jun/2015:01:58:09 +0800] 10.10.10.10 - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://example.com/robot/)" "text/html"
Parse the logs
Collect and prepare log data
-
See Quick Start to obtain the log file
aliyundoc.com_2018_10_30_000000_010000.gz. -
Upload the log file to a local Linux server.
-
Log on to the local Linux server and run the following command to decompress the log file.
gzip -d aliyundoc.com_2018_10_30_000000_010000.gzDecompression creates a log file named
aliyundoc.com_2018_10_30_000000_010000.
Identify and filter abnormal behavior
Check request volume
Analyze the request volume from each IP address in the offline access log to identify abnormal behavior.
-
Abnormally high request volume: Request volume from a single source IP address is significantly higher than the baseline, which can indicate traffic abuse.
-
Large volume of requests in a short period: A sudden traffic spike or an unusual, cyclical request pattern.
List the top 10 IP addresses by request volume.
cat [$Log_Txt] | awk '{print $3}' |sort|uniq -c|sort -nr |head -10Note-
awk '{print $3}': Extracts the third column of the log file, which is the IP address. Columns are separated by spaces. -
sort: Sorts the IP addresses. -
uniq -c: Counts the occurrences of each IP address. -
sort -nr: Sorts the results by count in descending order. -
head -n 10: Retrieves the top 10 IP addresses with the most requests. -
[$Log_Txt]: Replace with the name of your log file, for example,aliyundoc.com_xxxxxxx.
-
User agent analysis
Analyze the user agent in the offline access log to identify abnormal requests. Abnormal user agents typically have the following characteristics:
-
Abnormal or forged user agent: Many scraping tools use default or forged user agents. You can find these by filtering for uncommon, suspicious, or empty user agents.
Extract and count user agents.
grep -o '"Mozilla[^"]*' [$Log_Txt] | cut -d'"' -f2 | sort | uniq -c | sort -nr | head -n 10You can filter out common user agents to identify suspicious ones.
grep -v -E "Firefox|Chrome|Safari|Edge" [$Log_Txt]Count the number of requests with an empty user agent.
awk '!/Mozilla/' [$Log_Txt] | wc -lNotegrep -o: Outputs only the matched content.grep -v -E: Inverts the match to show lines that do not match the specified patterns.wc -l: Counts the number of lines.
Request pattern analysis
Analyze the requested URLs in the offline access log to identify abnormal request patterns. Abnormal URL requests typically have the following characteristics:
-
High URL similarity: Traffic scraping often involves a large number of requests to similar or identical URLs. Analyzing URL patterns can help you detect these abnormal requests.
-
High request ratio for specific resource types: An unusually high number of requests for specific resources, such as images, CSS, and JS files, may indicate traffic scraping.
List the top 10 URLs by request volume.
grep -oP '"https?://[^"]+"' [$Log_Txt] | sort | uniq -c | sort -nr | head -n 10
Status code analysis
Analyze the status codes in the offline access log to identify abnormal requests. Abnormal status codes typically have the following characteristics:
-
High ratio of 4xx or 5xx responses: A high number of 4xx or 5xx status codes from a single IP address may indicate malicious crawling attempts.
Count the occurrences of different status codes.
awk '{print $9}' [$Log_Txt] | sort | uniq -c | sort -nrList the top 10 IP addresses that generated 400 status codes.
grep ' 400 ' [$Log_Txt] | awk '{print $3}' | sort | uniq -c | sort -nr | head -n 10
Referrer analysis
You can identify the source of abnormal traffic by analyzing the Referer field in the offline access log. When your accelerated domain name experiences an unusual traffic surge, referrer analysis helps determine whether the traffic is from legitimate sources, hotlinking, or malicious attacks. Abnormal referrer values typically have the following characteristics:
-
High percentage of requests with an empty referrer: A large number of requests with an empty referrer may indicate direct access or scraping by tools. Under normal circumstances, requests for static resources like images and videos usually include a referrer from the source page. If traffic from requests with an empty referrer is significantly high, you should investigate these requests.
List the top 10 referrers by request volume.
awk '{print $6}' [$Log_Txt] | sort | uniq -c | sort -nr | head -n 10Count requests with an empty referrer.
awk '$6=="\"-\"" {count++} END {print count}' [$Log_Txt]For empty referrer requests, list the top 10 IP addresses by request volume.
awk '$6=="\"-\"" {print $3}' [$Log_Txt] | sort | uniq -c | sort -nr | head -n 10Note-
$6refers to the sixth field in the log entry, which corresponds to the referrer value. An empty referrer is represented as"-"in the log. -
If requests with an empty referrer generate a large amount of traffic, use the CDN hotlink protection feature to configure an allowlist or denylist.
-
-
Concentration of abnormal referrer sources: If the top referrers do not belong to your business domains, other sites may be hotlinking your resources.
Filter requests from a specific referrer and count the corresponding IP addresses.
grep 'example.com' [$Log_Txt] | awk '{print $3}' | sort | uniq -c | sort -nr | head -n 10NoteReplace
example.comwith the suspicious referrer domain. This command identifies which IP addresses use that referrer to hotlink your resources.