Regex external table

更新时间:
复制 MD 格式

This topic shows you how to create a Regex external table to parse unstructured text data in Object Storage Service (OSS) with regular expressions. It also lists the supported data types.

Limitations

  • OSS external tables do not support the cluster property.

  • The size of a single file cannot exceed 2 GB. You must split files that are larger than 2 GB.

  • MaxCompute and OSS must be in the same region.

  • Write operations are not supported.

  • Supported data types.

Prerequisites

  • Authorization

    • You have the CreateTable permission in your MaxCompute project. For more information about table permissions, see MaxCompute permissions.

    • OSS access permissions: An Alibaba Cloud account (primary account), a RAM user, or a RAM Role can access an OSS external table. For authorization details, see STS authorization for OSS.

  • You have created a MaxCompute project.

    MaxCompute is deployed only in specific regions. To avoid potential data connectivity issues, we recommend that your OSS bucket and MaxCompute project are in the same region.

  • (Optional) You have prepared an OSS bucket, a directory, and data files. For more information, see Create a bucket, Manage directories, and Simple upload.

    You can run a single SQL statement that contains both an external table and a User-Defined Function (UDF) to read from or write to the table. Manual directory creation is still supported.

Create an external table

Syntax

For the detailed syntax of external tables in various formats, see OSS external tables.

CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
  <col_name> <data_type>,
  ...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='<your_regex>',
    'mcfed.columns.comments'=''
)
STORED AS TEXTFILE
LOCATION '<oss_location>'
;

Common parameters

For more information, see Basic syntax parameter description.

SERDEPROPERTIES parameters

Parameter

Required

Description

mcfed.input.regex

Yes

Specifies the regular expression used to parse text data. Note the following:

  • Regex patterns such as \d must be escaped as \\d.

  • A Regex external table maps capture groups to columns. The pattern for each field must be enclosed in parentheses (). For example, use (123.*)\\t(123.*) to map two fields separated by a tab character (\t).

mcfed.columns.comments

Yes

You must add field comments when parsing text data with a regular expression.

Internal validation rule:

The number of comments specified for columns.comments must match the number of columns. For example, for four columns, you must provide four comment segments in the format a\0b\0c\0d.

Query and analyze data

Examples

Example 1: Match data starting with 123

  1. Prepare the test file regex_test1.txt and upload it to an OSS directory.

  2. Create a Regex external table to read only data that starts with 123. Other data is returned as NULL.

    CREATE EXTERNAL TABLE regex_example1 (
      id1 DOUBLE
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'mcfed.input.regex'='(123.*)',
        'mcfed.columns.comments'='a')
    STORED AS TEXTFILE
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test1/'; -- The LOCATION path must point to the parent directory of the test file.
  3. Query the Regex external table.

    SELECT * FROM regex_example1;
    
    -- The following result is returned:
    +------------+
    | id1        |
    +------------+
    | 123.0      |
    | NULL       |
    | 123.24     |
    +------------+

Example 2: Match fields starting with 123

  1. Prepare the test file regex_test2.txt and upload it to an OSS directory.

  2. Create a Regex external table to read only data where both fields start with the number 123. Other data is returned as NULL.

    CREATE EXTERNAL TABLE regex_example2 (
      id1 INT,
      id2 INT
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'mcfed.input.regex'='(123.*)\\t(123.*)',
        'mcfed.columns.comments'='a\0b'
    )
    STORED AS TEXTFILE
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test2/'; -- The LOCATION path must point to the parent directory of the test file.
  3. Query the Regex external table.

    SELECT * FROM regex_example2;
    
    -- The following result is returned:
    +------+------+
    | id1  | id2  |
    +------+------+
    | 123  | 123  |
    | NULL | NULL |
    +------+------+

Example 3: Match data across fields

  1. Prepare the test file regex_test3.txt and upload it to an OSS directory.

  2. Create a Regex external table to read only the values from the id, city_org, and ppl fields, skipping other fields.

    CREATE EXTERNAL TABLE regex_example3 (
      id INT,
      city_org STRING,
      ppl DOUBLE
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'mcfed.input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*', 
        'mcfed.columns.comments'='a\0b\0c'
    )
    STORED AS TEXTFILE
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test3/'; -- The LOCATION path must point to the parent directory of the test file.
  3. Query the Regex external table.

    SELECT * FROM regex_example3 WHERE id>100 AND id<121; 
    
    -- The following result is returned:
    +------+----------+------------+
    | id   | city_org | ppl        |
    +------+----------+------------+
    | 110  | La Coruña | 0.37       |
    | 112  | Cádiz     | 0.4        |
    | 120  | Köln      | 0.97       |
    +------+----------+------------+

Example 4: Match numeric-only data

  1. Prepare the test file regex_test4.txt and upload it to an OSS directory.

  2. Create a Regex external table to read only records where id1 is a three-digit number, id2 is any string that starts with '123', and id3 is a two-digit number.

    CREATE EXTERNAL TABLE regex_example4 (
      id1 INT,
      id2 INT,
      id3 INT
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'mcfed.input.regex'='(\\d{3}) (123.*) (\\d{2})', 
        'mcfed.columns.comments'='a\0b\0c'
    )
    STORED AS TEXTFILE
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test4/'; -- The LOCATION path must point to the parent directory of the test file.
  3. Query the Regex external table.

    SELECT * FROM regex_example4;
    
    -- The following result is returned:
    +------+------+------+
    | id1  | id2  | id3  |
    +------+------+------+
    | 123  | NULL | 12   |
    +------+------+------+

Example 5: Match dates with nested groups

  1. Prepare the test file regex_test5.txt and upload it to an OSS directory.

  2. Use the nested group ((\\d{4})-(\\d{2})-(\\d{2})) to match four columns: date, year, month, and day.

    CREATE EXTERNAL TABLE regex_example5a (
      date STRING,
      year STRING,
      month STRING,
      day STRING
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'mcfed.input.regex'='((\\d{4})-(\\d{2})-(\\d{2}))',  
        'mcfed.columns.comments'='a\0b\0c\0d'
    )
    STORED AS TEXTFILE
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.

    Query the Regex external table:

    SELECT * FROM regex_example5a;
    
    -- The following result is returned:
    +------------+------+-------+-----+
    | date       | year | month | day |
    +------------+------+-------+-----+
    | 2022-10-12 | 2022 | 10    | 12  |
    | 2023-01-10 | 2023 | 01    | 10  |
    | 2024-09-07 | 2024 | 09    | 07  |
    | 2025-06-29 | 2025 | 06    | 29  |
    +------------+------+-------+-----+
  3. Use (\\d{4}-\\d{2}-\\d{2}) to match the date column only.

    CREATE EXTERNAL TABLE regex_example5b (
      date STRING
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'mcfed.input.regex'='(\\d{4}-\\d{2}-\\d{2})',  
        'mcfed.columns.comments'='a'
    )
    STORED AS TEXTFILE
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.

    Query the Regex external table:

    SELECT * FROM regex_example5b;
    
    -- The following result is returned:
    +------------+
    | date       |
    +------------+
    | 2022-10-12 |
    | 2023-01-10 |
    | 2024-09-07 |
    | 2025-06-29 |
    +------------+
  4. Use (\\d{4})-(\\d{2})-(\\d{2}) to match three columns: year, month, and day.

    CREATE EXTERNAL TABLE regex_example5c (
      year STRING,
      month STRING,
      day STRING
    )
    ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
    WITH SERDEPROPERTIES (
        'mcfed.input.regex'='(\\d{4})-(\\d{2})-(\\d{2})',  
        'mcfed.columns.comments'='a\0b\0c'
    )
    STORED AS TEXTFILE
    LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.

    Query the Regex external table:

    SELECT * FROM regex_example5c;
    
    -- The following result is returned:
    +------+-------+-----+
    | year | month | day |
    +------+-------+-----+
    | 2022 | 10    | 12  |
    | 2023 | 01    | 10  |
    | 2024 | 09    | 07  |
    | 2025 | 06    | 29  |
    +------+-------+-----+

Supported data types

For more information about MaxCompute data types, see Data types (Edition 1.0) and Data types (Edition 2.0).

Type

Supported

Type

Supported

TINYINT

Supported

STRING

Supported

SMALLINT

Supported

DATE

Supported

INT

Supported

DATETIME

Supported

BIGINT

Supported

TIMESTAMP

Not supported

BINARY

Not supported

TIMESTAMP_NTZ

Supported

FLOAT

Supported

BOOLEAN

Supported

DOUBLE

Supported

ARRAY

Not supported

DECIMAL(precision,scale)

Supported

MAP

Not supported

VARCHAR(n)

Supported

STRUCT

Not supported

CHAR(n)

Supported

JSON

Not supported