This topic shows you how to create a Regex external table to parse unstructured text data in Object Storage Service (OSS) with regular expressions. It also lists the supported data types.
Limitations
OSS external tables do not support the cluster property.
The size of a single file cannot exceed 2 GB. You must split files that are larger than 2 GB.
MaxCompute and OSS must be in the same region.
Write operations are not supported.
Prerequisites
Authorization
You have the CreateTable permission in your MaxCompute project. For more information about table permissions, see MaxCompute permissions.
OSS access permissions: An Alibaba Cloud account (primary account), a RAM user, or a RAM Role can access an OSS external table. For authorization details, see STS authorization for OSS.
You have created a MaxCompute project.
MaxCompute is deployed only in specific regions. To avoid potential data connectivity issues, we recommend that your OSS bucket and MaxCompute project are in the same region.
(Optional) You have prepared an OSS bucket, a directory, and data files. For more information, see Create a bucket, Manage directories, and Simple upload.
You can run a single SQL statement that contains both an external table and a User-Defined Function (UDF) to read from or write to the table. Manual directory creation is still supported.
Create an external table
Syntax
For the detailed syntax of external tables in various formats, see OSS external tables.
CREATE EXTERNAL TABLE [IF NOT EXISTS] <mc_oss_extable_name>
(
<col_name> <data_type>,
...
)
[COMMENT <table_comment>]
[PARTITIONED BY (<col_name> <data_type>, ...)]
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
'mcfed.input.regex'='<your_regex>',
'mcfed.columns.comments'=''
)
STORED AS TEXTFILE
LOCATION '<oss_location>'
;Common parameters
For more information, see Basic syntax parameter description.
SERDEPROPERTIES parameters
Parameter | Required | Description |
mcfed.input.regex | Yes | Specifies the regular expression used to parse text data. Note the following:
|
mcfed.columns.comments | Yes | You must add field comments when parsing text data with a regular expression. Internal validation rule: The number of comments specified for |
Query and analyze data
For more information about the SELECT syntax, see Read data from OSS.
To optimize your query plan, see Query optimization.
Examples
Example 1: Match data starting with 123
Prepare the test file regex_test1.txt and upload it to an OSS directory.
Create a Regex external table to read only data that starts with
123. Other data is returned as NULL.CREATE EXTERNAL TABLE regex_example1 ( id1 DOUBLE ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'mcfed.input.regex'='(123.*)', 'mcfed.columns.comments'='a') STORED AS TEXTFILE LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test1/'; -- The LOCATION path must point to the parent directory of the test file.Query the Regex external table.
SELECT * FROM regex_example1; -- The following result is returned: +------------+ | id1 | +------------+ | 123.0 | | NULL | | 123.24 | +------------+
Example 2: Match fields starting with 123
Prepare the test file regex_test2.txt and upload it to an OSS directory.
Create a Regex external table to read only data where both fields start with the number
123. Other data is returned as NULL.CREATE EXTERNAL TABLE regex_example2 ( id1 INT, id2 INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'mcfed.input.regex'='(123.*)\\t(123.*)', 'mcfed.columns.comments'='a\0b' ) STORED AS TEXTFILE LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test2/'; -- The LOCATION path must point to the parent directory of the test file.Query the Regex external table.
SELECT * FROM regex_example2; -- The following result is returned: +------+------+ | id1 | id2 | +------+------+ | 123 | 123 | | NULL | NULL | +------+------+
Example 3: Match data across fields
Prepare the test file regex_test3.txt and upload it to an OSS directory.
Create a Regex external table to read only the values from the
id,city_org, andpplfields, skipping other fields.CREATE EXTERNAL TABLE regex_example3 ( id INT, city_org STRING, ppl DOUBLE ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'mcfed.input.regex'='^(\\d+)\\t([^\\t]*)\\t\\S+\\t\\S+\\t(\\d++.\\d++).*', 'mcfed.columns.comments'='a\0b\0c' ) STORED AS TEXTFILE LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test3/'; -- The LOCATION path must point to the parent directory of the test file.Query the Regex external table.
SELECT * FROM regex_example3 WHERE id>100 AND id<121; -- The following result is returned: +------+----------+------------+ | id | city_org | ppl | +------+----------+------------+ | 110 | La Coruña | 0.37 | | 112 | Cádiz | 0.4 | | 120 | Köln | 0.97 | +------+----------+------------+
Example 4: Match numeric-only data
Prepare the test file regex_test4.txt and upload it to an OSS directory.
Create a Regex external table to read only records where
id1is a three-digit number,id2is any string that starts with '123', andid3is a two-digit number.CREATE EXTERNAL TABLE regex_example4 ( id1 INT, id2 INT, id3 INT ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'mcfed.input.regex'='(\\d{3}) (123.*) (\\d{2})', 'mcfed.columns.comments'='a\0b\0c' ) STORED AS TEXTFILE LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test4/'; -- The LOCATION path must point to the parent directory of the test file.Query the Regex external table.
SELECT * FROM regex_example4; -- The following result is returned: +------+------+------+ | id1 | id2 | id3 | +------+------+------+ | 123 | NULL | 12 | +------+------+------+
Example 5: Match dates with nested groups
Prepare the test file regex_test5.txt and upload it to an OSS directory.
Use the nested group
((\\d{4})-(\\d{2})-(\\d{2}))to match four columns:date,year,month, andday.CREATE EXTERNAL TABLE regex_example5a ( date STRING, year STRING, month STRING, day STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'mcfed.input.regex'='((\\d{4})-(\\d{2})-(\\d{2}))', 'mcfed.columns.comments'='a\0b\0c\0d' ) STORED AS TEXTFILE LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.Query the Regex external table:
SELECT * FROM regex_example5a; -- The following result is returned: +------------+------+-------+-----+ | date | year | month | day | +------------+------+-------+-----+ | 2022-10-12 | 2022 | 10 | 12 | | 2023-01-10 | 2023 | 01 | 10 | | 2024-09-07 | 2024 | 09 | 07 | | 2025-06-29 | 2025 | 06 | 29 | +------------+------+-------+-----+Use
(\\d{4}-\\d{2}-\\d{2})to match thedatecolumn only.CREATE EXTERNAL TABLE regex_example5b ( date STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'mcfed.input.regex'='(\\d{4}-\\d{2}-\\d{2})', 'mcfed.columns.comments'='a' ) STORED AS TEXTFILE LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.Query the Regex external table:
SELECT * FROM regex_example5b; -- The following result is returned: +------------+ | date | +------------+ | 2022-10-12 | | 2023-01-10 | | 2024-09-07 | | 2025-06-29 | +------------+Use
(\\d{4})-(\\d{2})-(\\d{2})to match three columns:year,month, andday.CREATE EXTERNAL TABLE regex_example5c ( year STRING, month STRING, day STRING ) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe' WITH SERDEPROPERTIES ( 'mcfed.input.regex'='(\\d{4})-(\\d{2})-(\\d{2})', 'mcfed.columns.comments'='a\0b\0c' ) STORED AS TEXTFILE LOCATION 'oss://oss-cn-hangzhou-internal.aliyuncs.com/<bucket>/regex_test5/'; -- The LOCATION path must point to the parent directory of the test file.Query the Regex external table:
SELECT * FROM regex_example5c; -- The following result is returned: +------+-------+-----+ | year | month | day | +------+-------+-----+ | 2022 | 10 | 12 | | 2023 | 01 | 10 | | 2024 | 09 | 07 | | 2025 | 06 | 29 | +------+-------+-----+
Supported data types
For more information about MaxCompute data types, see Data types (Edition 1.0) and Data types (Edition 2.0).
Type | Supported | Type | Supported |
TINYINT | STRING | ||
SMALLINT | DATE | ||
INT | DATETIME | ||
BIGINT | TIMESTAMP | ||
BINARY | TIMESTAMP_NTZ | ||
FLOAT | BOOLEAN | ||
DOUBLE | ARRAY | ||
DECIMAL(precision,scale) | MAP | ||
VARCHAR(n) | STRUCT | ||
CHAR(n) | JSON |