Regex external table-MaxCompute(MaxCompute)-阿里云帮助中心

This topic shows you how to create a Regex external table to parse unstructured text data in Object Storage Service (OSS) with regular expressions. It also lists the supported data types.

Limitations

OSS external tables do not support the cluster property.
The size of a single file cannot exceed 2 GB. You must split files that are larger than 2 GB.
MaxCompute and OSS must be in the same region.
Write operations are not supported.
Supported data types.

Prerequisites

Authorization
- You have the CreateTable permission in your MaxCompute project. For more information about table permissions, see MaxCompute permissions.
- OSS access permissions: An Alibaba Cloud account (primary account), a RAM user, or a RAM Role can access an OSS external table. For authorization details, see STS authorization for OSS.
You have created a MaxCompute project.
MaxCompute is deployed only in specific regions. To avoid potential data connectivity issues, we recommend that your OSS bucket and MaxCompute project are in the same region.
(Optional) You have prepared an OSS bucket, a directory, and data files. For more information, see Create a bucket, Manage directories, and Simple upload.
You can run a single SQL statement that contains both an external table and a User-Defined Function (UDF) to read from or write to the table. Manual directory creation is still supported.

Create an external table

Syntax

For the detailed syntax of external tables in various formats, see OSS external tables.

CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
  <col_name> <data_type>,
  ...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='<your_regex>',
    'mcfed.columns.comments'=''
)
STORED AS TEXTFILE
LOCATION '<oss_location>'
;

Common parameters

For more information, see Basic syntax parameter description.

SERDEPROPERTIES parameters

Parameter

Required

Description

mcfed.input.regex

Yes

Specifies the regular expression used to parse text data. Note the following:

Regex patterns such as \d must be escaped as \\d.
A Regex external table maps capture groups to columns. The pattern for each field must be enclosed in parentheses (). For example, use (123.*)\\t(123.*) to map two fields separated by a tab character (\t).

mcfed.columns.comments

Yes

You must add field comments when parsing text data with a regular expression.

Internal validation rule:

The number of comments specified for columns.comments must match the number of columns. For example, for four columns, you must provide four comment segments in the format a\0b\0c\0d.

Query and analyze data

For more information about the SELECT syntax, see Read data from OSS.
To optimize your query plan, see Query optimization.

Examples

Example 1: Match data starting with 123

Prepare the test file regex_test1.txt and upload it to an OSS directory.

Create a Regex external table to read only data that starts with 123. Other data is returned as NULL.

CREATE EXTERNAL TABLE regex_example1 (
  id1 DOUBLE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='(123.*)',
    'mcfed.columns.comments'='a')
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test1/'; -- The LOCATION path must point to the parent directory of the test file.

Query the Regex external table.

SELECT * FROM regex_example1;

-- The following result is returned:
+------------+
| id1        |
+------------+
| 123.0      |
| NULL       |
| 123.24     |
+------------+

Example 2: Match fields starting with 123

Prepare the test file regex_test2.txt and upload it to an OSS directory.

Create a Regex external table to read only data where both fields start with the number 123. Other data is returned as NULL.

CREATE EXTERNAL TABLE regex_example2 (
  id1 INT,
  id2 INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='(123.*)\\t(123.*)',
    'mcfed.columns.comments'='a\0b'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test2/'; -- The LOCATION path must point to the parent directory of the test file.

Query the Regex external table.

SELECT * FROM regex_example2;

-- The following result is returned:
+------+------+
| id1  | id2  |
+------+------+
| 123  | 123  |
| NULL | NULL |
+------+------+

Example 3: Match data across fields

Prepare the test file regex_test3.txt and upload it to an OSS directory.

Create a Regex external table to read only the values from the id, city_org, and ppl fields, skipping other fields.

CREATE EXTERNAL TABLE regex_example3 (
  id INT,
  city_org STRING,
  ppl DOUBLE
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*', 
    'mcfed.columns.comments'='a\0b\0c'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test3/'; -- The LOCATION path must point to the parent directory of the test file.

Query the Regex external table.

SELECT * FROM regex_example3 WHERE id>100 AND id<121; 

-- The following result is returned:
+------+----------+------------+
| id   | city_org | ppl        |
+------+----------+------------+
| 110  | La Coruña | 0.37       |
| 112  | Cádiz     | 0.4        |
| 120  | Köln      | 0.97       |
+------+----------+------------+

Example 4: Match numeric-only data

Prepare the test file regex_test4.txt and upload it to an OSS directory.

Create a Regex external table to read only records where id1 is a three-digit number, id2 is any string that starts with '123', and id3 is a two-digit number.

CREATE EXTERNAL TABLE regex_example4 (
  id1 INT,
  id2 INT,
  id3 INT
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='(\\d{3}) (123.*) (\\d{2})', 
    'mcfed.columns.comments'='a\0b\0c'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test4/'; -- The LOCATION path must point to the parent directory of the test file.

Query the Regex external table.

SELECT * FROM regex_example4;

-- The following result is returned:
+------+------+------+
| id1  | id2  | id3  |
+------+------+------+
| 123  | NULL | 12   |
+------+------+------+

Example 5: Match dates with nested groups

Prepare the test file regex_test5.txt and upload it to an OSS directory.

Use the nested group ((\\d{4})-(\\d{2})-(\\d{2})) to match four columns: date, year, month, and day.

CREATE EXTERNAL TABLE regex_example5a (
  date STRING,
  year STRING,
  month STRING,
  day STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='((\\d{4})-(\\d{2})-(\\d{2}))',  
    'mcfed.columns.comments'='a\0b\0c\0d'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.

Query the Regex external table:

SELECT * FROM regex_example5a;

-- The following result is returned:
+------------+------+-------+-----+
| date       | year | month | day |
+------------+------+-------+-----+
| 2022-10-12 | 2022 | 10    | 12  |
| 2023-01-10 | 2023 | 01    | 10  |
| 2024-09-07 | 2024 | 09    | 07  |
| 2025-06-29 | 2025 | 06    | 29  |
+------------+------+-------+-----+

Use (\\d{4}-\\d{2}-\\d{2}) to match the date column only.

CREATE EXTERNAL TABLE regex_example5b (
  date STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='(\\d{4}-\\d{2}-\\d{2})',  
    'mcfed.columns.comments'='a'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.

Query the Regex external table:

SELECT * FROM regex_example5b;

-- The following result is returned:
+------------+
| date       |
+------------+
| 2022-10-12 |
| 2023-01-10 |
| 2024-09-07 |
| 2025-06-29 |
+------------+

Use (\\d{4})-(\\d{2})-(\\d{2}) to match three columns: year, month, and day.

CREATE EXTERNAL TABLE regex_example5c (
  year STRING,
  month STRING,
  day STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
    'mcfed.input.regex'='(\\d{4})-(\\d{2})-(\\d{2})',  
    'mcfed.columns.comments'='a\0b\0c'
)
STORED AS TEXTFILE
LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.

Query the Regex external table:

SELECT * FROM regex_example5c;

-- The following result is returned:
+------+-------+-----+
| year | month | day |
+------+-------+-----+
| 2022 | 10    | 12  |
| 2023 | 01    | 10  |
| 2024 | 09    | 07  |
| 2025 | 06    | 29  |
+------+-------+-----+

Supported data types

For more information about MaxCompute data types, see Data types (Edition 1.0) and Data types (Edition 2.0).

Type	Supported	Type	Supported
TINYINT	Supported	STRING	Supported
SMALLINT	Supported	DATE	Supported
INT	Supported	DATETIME	Supported
BIGINT	Supported	TIMESTAMP	Not supported
BINARY	Not supported	TIMESTAMP_NTZ	Supported
FLOAT	Supported	BOOLEAN	Supported
DOUBLE	Supported	ARRAY	Not supported
DECIMAL(precision,scale)	Supported	MAP	Not supported
VARCHAR(n)	Supported	STRUCT	Not supported
CHAR(n)	Supported	JSON	Not supported