首页 CDN User Guide Logs and reports Offline logs Quickly analyze offline access logs

Quickly analyze offline access logs

更新时间: 2026-06-05 02:35:23

After you obtain an offline log file, you can use a command-line tool to quickly parse the file and extract information such as the top 10 IP addresses by request volume, user agents, and referrers. This topic shows how to analyze CDN offline access logs with a command-line tool in a Linux environment.

Prerequisites

You have downloaded an offline access log. For details, see Quick Start.

Usage notes

  • Naming rule for log files: Accelerated domain name_year_month_day_start time_end time[extension field].gz. The extension field starts with an underscore (_). Example: aliyundoc.com_2018_10_30_000000_010000_xx.gz.

    Note

    Names of specific log files may not contain an extension field. Example: aliyundoc.com_2018_10_30_000000_010000.gz.

  • Sample log entry

    [9/Jun/2015:01:58:09 +0800] 10.10.10.10 - 1542 "-" "GET http://www.aliyun.com/index.html" 200 191 2830 MISS "Mozilla/5.0 (compatible; AhrefsBot/5.0; +http://example.com/robot/)" "text/html"

Parse the logs

Collect and prepare log data

  1. See Quick Start to obtain the log file aliyundoc.com_2018_10_30_000000_010000.gz.

  2. Upload the log file to a local Linux server.

  3. Log on to the local Linux server and run the following command to decompress the log file.

    gzip -d aliyundoc.com_2018_10_30_000000_010000.gz

    Decompression creates a log file named aliyundoc.com_2018_10_30_000000_010000.

Identify and filter abnormal behavior

Check request volume

Analyze the request volume from each IP address in the offline access log to identify abnormal behavior.

  • Abnormally high request volume: Request volume from a single source IP address is significantly higher than the baseline, which can indicate traffic abuse.

  • Large volume of requests in a short period: A sudden traffic spike or an unusual, cyclical request pattern.

    List the top 10 IP addresses by request volume.

    cat [$Log_Txt] | awk '{print $3}' |sort|uniq -c|sort -nr |head -10
    Note
    • awk '{print $3}': Extracts the third column of the log file, which is the IP address. Columns are separated by spaces.

    • sort: Sorts the IP addresses.

    • uniq -c: Counts the occurrences of each IP address.

    • sort -nr: Sorts the results by count in descending order.

    • head -n 10: Retrieves the top 10 IP addresses with the most requests.

    • [$Log_Txt]: Replace with the name of your log file, for example, aliyundoc.com_xxxxxxx.

User agent analysis

Analyze the user agent in the offline access log to identify abnormal requests. Abnormal user agents typically have the following characteristics:

  • Abnormal or forged user agent: Many scraping tools use default or forged user agents. You can find these by filtering for uncommon, suspicious, or empty user agents.

    Extract and count user agents.

     grep -o '"Mozilla[^"]*' [$Log_Txt] | cut -d'"' -f2 | sort | uniq -c | sort -nr | head -n 10

    You can filter out common user agents to identify suspicious ones.

    grep -v -E "Firefox|Chrome|Safari|Edge" [$Log_Txt]

    Count the number of requests with an empty user agent.

    awk '!/Mozilla/' [$Log_Txt] | wc -l
    Note

    grep -o: Outputs only the matched content.

    grep -v -E: Inverts the match to show lines that do not match the specified patterns.

    wc -l : Counts the number of lines.

Request pattern analysis

Analyze the requested URLs in the offline access log to identify abnormal request patterns. Abnormal URL requests typically have the following characteristics:

  • High URL similarity: Traffic scraping often involves a large number of requests to similar or identical URLs. Analyzing URL patterns can help you detect these abnormal requests.

  • High request ratio for specific resource types: An unusually high number of requests for specific resources, such as images, CSS, and JS files, may indicate traffic scraping.

    List the top 10 URLs by request volume.

    grep -oP '"https?://[^"]+"' [$Log_Txt] | sort | uniq -c | sort -nr | head -n 10

Status code analysis

Analyze the status codes in the offline access log to identify abnormal requests. Abnormal status codes typically have the following characteristics:

  • High ratio of 4xx or 5xx responses: A high number of 4xx or 5xx status codes from a single IP address may indicate malicious crawling attempts.

    Count the occurrences of different status codes.

    awk '{print $9}' [$Log_Txt] | sort | uniq -c | sort -nr

    List the top 10 IP addresses that generated 400 status codes.

    grep ' 400 ' [$Log_Txt] | awk '{print $3}' | sort | uniq -c | sort -nr | head -n 10

Referrer analysis

You can identify the source of abnormal traffic by analyzing the Referer field in the offline access log. When your accelerated domain name experiences an unusual traffic surge, referrer analysis helps determine whether the traffic is from legitimate sources, hotlinking, or malicious attacks. Abnormal referrer values typically have the following characteristics:

  • High percentage of requests with an empty referrer: A large number of requests with an empty referrer may indicate direct access or scraping by tools. Under normal circumstances, requests for static resources like images and videos usually include a referrer from the source page. If traffic from requests with an empty referrer is significantly high, you should investigate these requests.

    List the top 10 referrers by request volume.

    awk '{print $6}' [$Log_Txt] | sort | uniq -c | sort -nr | head -n 10

    Count requests with an empty referrer.

    awk '$6=="\"-\"" {count++} END {print count}' [$Log_Txt]

    For empty referrer requests, list the top 10 IP addresses by request volume.

    awk '$6=="\"-\"" {print $3}' [$Log_Txt] | sort | uniq -c | sort -nr | head -n 10
    Note
    • $6 refers to the sixth field in the log entry, which corresponds to the referrer value. An empty referrer is represented as "-" in the log.

    • If requests with an empty referrer generate a large amount of traffic, use the CDN hotlink protection feature to configure an allowlist or denylist.

  • Concentration of abnormal referrer sources: If the top referrers do not belong to your business domains, other sites may be hotlinking your resources.

    Filter requests from a specific referrer and count the corresponding IP addresses.

    grep 'example.com' [$Log_Txt] | awk '{print $3}' | sort | uniq -c | sort -nr | head -n 10
    Note

    Replace example.com with the suspicious referrer domain. This command identifies which IP addresses use that referrer to hotlink your resources.

上一篇: Grant RAM user permissions for log storage 下一篇: Real-time logs
阿里云首页 CDN 相关技术圈