Access unstructured OSS data with object table

更新时间:
复制 MD 格式

Object tables map files stored in OSS into a queryable Hologres table—similar to an external table, but purpose-built for unstructured data such as images, PDFs, audio, and video. Hologres V4.0 and later support this feature.

With object tables, you can:

Prerequisites

Before you begin, ensure that you have:

  • A Hologres instance running V4.0 or later. To upgrade, see Upgrade an instance

  • An OSS bucket containing the files you want to access

  • A RAM role with read access to the OSS bucket. To find the role ARN (Alibaba Cloud Resource Name), go to the Resource Access Management (RAM) console, click Identity Management > Role, and then click the role name

Limitations

  • Only classic network endpoints are supported.

  • Only manual refresh is supported. There is no automatic refresh.

  • By default, object tables scan only the top-level directory of the specified path. Subdirectories are not included. To scan subdirectories recursively, run:

    Note

    Recursive scans consume more resources than top-level scans. Set path to the most specific directory that contains your target files.

    ALTER DATABASE <db_name> SET hg_experimental_enable_oss_meta_recursive = on;

Create an object table

Object tables have a fixed schema—you do not specify columns when creating one.

Syntax:

CREATE OBJECT TABLE [IF NOT EXISTS] [schema_name.]<table_name>
WITH (
  -- Required
  path          = '<oss_path>',
  oss_endpoint  = '<oss_endpoint>',
  role_arn      = '<role_arn>',

  -- Optional
  [orientation                  = 'column | row | row,column',]
  [table_group                  = '<table_group_name>',]
  [distribution_key             = '<column_name>[,...]',]
  [clustering_key               = '<column_name>[:asc][,...]',]
  [storage_mode                 = 'hot | cold',]
  [event_time_column            = '<column_name>[,...]',]
  [bitmap_columns               = '<column_name>[,...]',]
  [dictionary_encoding_columns  = '<column_name>[,...]',]
  [time_to_live_in_seconds      = '<non_negative_integer>']
);

After creating an object table, refresh it once to load data:

REFRESH OBJECT TABLE [schema_name.]<table_name>;

Required parameters

Optional parameters

Parameter

Description

Example

path

The OSS directory path containing your files. The object table reads metadata from files in this directory.

oss://my-bucket/my-dir

oss_endpoint

The OSS classic network endpoint for the bucket region. For endpoints by region, see .

role_arn

The ARN of a RAM role with read access to the OSS bucket.

acs:ram::role-id:role/role-name

Parameter

Default

Description

orientation

column

Storage format. Supported values: column, row, row,column. See Table storage formats.

table_group

Default table group

The table group and shard count for the object table.

distribution_key

object_uri

The distribution key.

clustering_key

object_uri

The clustering key.

storage_mode

hot

Storage mode. Supported values: hot (hot storage), cold (cold storage). See Data tiering.

event_time_column

last_modified_at

The event time column (segment key).

bitmap_columns

object_uri:auto,etag:auto

The bitmap index columns.

dictionary_encoding_columns

object_uri:auto,etag:auto

Dictionary encoding columns.

time_to_live_in_seconds

3153600000

Data lifecycle for the table in seconds.

Query an object table

After the initial refresh, query an object table like any standard table:

SELECT * FROM <object_table_name>;

Object tables have the following fixed columns:

To view the DDL or storage properties of an object table, use either of the following:

-- View DDL
SELECT hg_dump_script('[schema_name.]<object_table_name>');

-- View storage properties
SELECT * FROM hologres.hg_table_properties WHERE table_name = '<object_table_name>';

Refresh an object table

Refresh an object table to sync file metadata from OSS. Only manual refresh is supported.

REFRESH OBJECT TABLE [schema_name.]<table_name>;
Note

Refreshing consumes compute resources. To minimize cost, set path to the most specific directory that contains your target files.

Manage an object table

Rename or move an object table

-- Rename
ALTER OBJECT TABLE [IF EXISTS] <table_name> RENAME TO <new_name>;

-- Move to a different schema
ALTER OBJECT TABLE [IF EXISTS] [schema_name.]<table_name> SET SCHEMA <new_schema>;

Delete an object table

-- Moves to the recycle bin if it is enabled; otherwise, permanently deletes
DROP OBJECT TABLE [IF EXISTS] <table_name>;

-- Permanently deletes without moving to the recycle bin
DROP OBJECT TABLE [IF EXISTS] <table_name> FORCE;

If the database recycle bin is enabled, dropping an object table moves it to the recycle bin automatically. The object table can be recovered using the recycle bin recovery command, and it remains an object table after recovery. For details, see Recycle bin.

Warning

The FORCE option permanently deletes the object table. Recovery is not possible.

Usage example

This example shows how to use an Object Table to parse PDF documents. Follow these steps.

Preparations

  • Prepare data

    1. Download 80 prospectuses (PDF) from the Bojin Large Language Model Challenge—Financial Qwen-14B Dataset.

    2. Log on to the OSS console. . Upload the downloaded PDF files to the bucket path. For upload details, see .

  • Prepare accounts

    1. Log on to the Resource Access Management (RAM) console. Create an Alibaba Cloud RAM role and grant it OSS permissions.

      We recommend granting the AliyunOSSReadOnlyAccess permission.

    2. Grant login and Hologres access permissions to the RAM role.

      • Alibaba Cloud account (root account)

        Update the trust policy for the RAM role. Update these parameters:

        • Action: Set to sts:AssumeRole.

        • Service: Set to hologres.aliyuncs.com.

        {
          "Statement": [
            {
              "Action": "sts:AssumeRole",
              "Effect": "Allow",
              "Principal": {
                "RAM": [
                  "acs:ram::1866xxxx:root"
                ],
                "Service": [
                  "hologres.aliyuncs.com"
                ]
              }
            }
          ],
          "Version": "1"
        }
      • RAM user (sub-account)

        1. Grant permissions to the RAM user.

          1. On the Access Control page, click Create Policy, and select the Script Editor mode to create a policy. For more information, see Create a custom policy.

            Hologres uses this policy to check whether the RAM user has permission to create the RAM role. Policy content:

            {
              "Version": "1",
              "Statement": [
                {
                  "Effect": "Allow",
                  "Action": "hologram:GrantAssumeRole",
                  "Resource": "<RoleARN>"
                }
              ]
            }
          2. On the Identity Management > User page, click Add Permissions in the Actions

        2. Grant permissions to the RAM role.

          Update the trust policy for the RAM role. Update these parameters:

          • Action: Set to sts:AssumeRole.

          • Service: Set to hologres.aliyuncs.com.

          {
            "Statement": [
              {
                "Action": "sts:AssumeRole",
                "Effect": "Allow",
                "Principal": {
                  "RAM": [
                    "acs:ram::1866xxxx:root"
                  ],
                  "Service": [
                    "hologres.aliyuncs.com"
                  ]
                }
              }
            ],
            "Version": "1"
          }
  • Prepare a Hologres instance

    1. Purchase a Hologres instance.

      Ensure the Hologres instance version is V4.0 or later.

    2. .

Procedure

  1. Connect to HoloWeb and run queries. Create an Object Table to link to the PDF files in OSS.

    CREATE OBJECT TABLE public.pdf_bs_challenge_financial_14b_dataset
    WITH (
      path = 'oss://xxxx-hangzhou/bs_challenge_financial_14b_dataset/pdf',                           -- Required
      oss_endpoint = 'oss-cn-hangzhou-internal.aliyuncs.com',        
      role_arn = 'acs:ram::xxx:role/xxxx'                        
    );
  2. Refresh and query the Object Table.

    You must manually refresh the Object Table once after creation to query data. Run these commands:

    -- Refresh
    REFRESH OBJECT TABLE public.pdf_bs_challenge_financial_14b_dataset;
    -- Query
    SELECT object_uri  ,etag, file->>'size' AS size, file->>'last_modified_at' as last_modified_at FROM public.pdf_bs_challenge_financial_14b_dataset;

    The query returns:

      object_uri        |      etag       |   size   |    last_modified_at    
    ---------------------------------------------------------------------------------
     oss://hm-****-hangzhou/bs_challenge_financial_14b_dataset/pdf/5c2707b186a798118bba251cce8d1fc331b****.PDF | 836F504F63F4B77A2271CD3285DDECF7 | 4170245  | 2025-08-18 14:18:19+08
     oss://hm-****-hangzhou/bs_challenge_financial_14b_dataset/pdf/72103a0ef6299585cdd09751682534d873f****.PDF | 39DA597D83BBCA571AC8B94CC464CDC8 | 4022782  | 2025-08-18 14:17:05+08
  3. Search unstructured data.

    An Object Table maps PDF metadata fields one-to-one. You can combine it with Hologres AI Function to search and infer PDF content. Follow these steps:

    1. .

      This example uses AI resources with the specification large-96core-512G-384G and one node.

    2. Deploy AI models.

      After purchasing AI resources, go to the Hologres console to deploy models. For PDF search, deploy the following models and resources:

Column

Type

Description

object_uri

TEXT

The exact OSS file path.

etag

TEXT

A unique identifier for the file content, assigned when the object is created.

file

FILE

A JSON-like structure containing detailed file metadata. Parse it using JSON and JSONB types. Fields include: object_uri, etag, size, last_modified_at, owner_name, object_source (always OSS), oss_endpoint, object_table_id, and role_arn.

metadata

JSON

Custom metadata attached to the file.

Create a table to store embedded PDF data.

Create an internal Hologres table to store embedded PDF data. Build a vector index on this table for vector search.

-- Create embedding table
CREATE  TABLE public.embed_pdf_bs (
    object_uri text NOT NULL,
    etag text NOT NULL,
    chunk text NOT NULL,
    embedding_vector float4[] CHECK(array_ndims(embedding_vector) = 1 AND array_length(embedding_vector, 1) =1024)
) WITH (
bitmap_columns = 'object_uri,etag,chunk',
clustering_key = 'object_uri,etag',
distribution_key = 'object_uri',
vectors = '{
      "embedding_vector": {
        "algorithm": "HGraph",
        "distance_method": "Cosine",
        "builder_params": {
        "base_quantization_type": "sq8_uniform",
        "max_degree": 64,
        "ef_construction": 400,
        "precise_quantization_type": "fp32",
        "use_reorder": true
        }
      }}'
      );

Call AI Function to embed and chunk PDFs, then write results to the table.

-- Convert PDFs to documents and embed
insert into public.embed_pdf_bs 
  WITH parsed_doc AS (
    SELECT object_uri, etag, ai_parse_document(file, 'auto', 'markdown') AS doc 
    FROM pdf_bs_challenge_financial_14b_dataset limit 1), 
  chunked_doc AS (
    SELECT object_uri, etag, unnest(ai_chunk(doc, 300, 50)) AS chunk 
    FROM parsed_doc )
  SELECT object_uri, etag, chunk, ai_embed(chunk) AS embedding_vector 
  FROM chunked_doc
  WHERE chunk IS NOT NULL AND length(chunk) > 0;

Search and infer PDF content.

You can assess whether a company’s future outlook is optimistic or pessimistic based on financial trends in prospectuses. This helps guide investment decisions. Example SQL:

WITH
  -- Embed question and retrieve
  embedding_recall AS (
    SELECT chunk FROM embed_pdf_bs
    ORDER BY
    approx_cosine_distance(
      embedding_vector,
      ai_embed('In the reporting period, what were the year-on-year growth rates for revenue and net profit of Hunan Goke Microelectronics Co., Ltd. in 2014, 2015, and 2016?')
    ) DESC
    LIMIT 20
  ),
  -- Concatenate top chunks
  concat_top_chunks AS (
    SELECT string_agg(chunk, E'\n\n----\n\n') AS merged_chunks FROM embedding_recall
  )
-- Generate final answer
SELECT
  ai_gen('Analyze whether the following financial trend is optimistic or pessimistic, and explain why: ' ||
         ai_gen('In the reporting period, what were the year-on-year growth rates for revenue and net profit of Hunan Goke Microelectronics Co., Ltd. in 2014, 2015, and 2016? '  || merged_chunks))
FROM concat_top_chunks;

The query returns:

ai_gen
----
"From the provided financial data and growth analysis, Hunan Goke Microelectronics Co., Ltd. showed an overall **optimistic** performance trend from 2014 to 2016, for the following reasons:

### 1. **Revenue grew consistently**
- Revenue increased by 112.94% in 2014, 103.00% in 2015, and 33.22% in 2016. This indicates steady business expansion and rising market demand.
- Growth was driven by mass production and sales of new products (e.g., smart surveillance chips and GK-series chips) and partnerships with well-known distributors, highlighting strengths in product innovation and market outreach.

### 2. **Net profit rose significantly**
- Net profit surged by 467.53% in 2014, 8.21% in 2015, and 33.07% in 2016. Although growth slowed in 2015, it remained positive.
- In 2016, higher-margin products (e.g., solid-state storage chips) contributed more to revenue, while sales expenses declined. This reflects successful optimization of product mix and cost control.

### 3. **Product mix improved and innovation advanced**
- In 2014, the company benefited from the opening of the Live Satellite market. In 2015, smart surveillance chips entered mass production. In 2016, GK-series chips were adopted by major distributors. This shows the company’s ability to align with market needs and drive growth through innovation.
- Higher-margin products made up a larger share of sales, boosting overall profitability.

### 4. **Market reach expanded**
- Partnerships with leading distributors (e.g., Koton and Shijie) indicate breakthroughs in channel development, helping expand market share and brand influence.

### 5. **Overall trend is positive**
- Although revenue growth slowed in 2016 compared to 2015, net profit continued to rise at a healthy pace. This signals improved operational efficiency and profitability. It shows the company adapts well to market changes and optimizes effectively.

### Conclusion:
Overall, Hunan Goke Microelectronics Co., Ltd. showed an **optimistic** performance trend from 2014 to 2016. The company achieved consistent growth in both revenue and net profit. It strengthened its core competitiveness and profitability through product mix optimization, innovation, and market expansion. While revenue growth slowed in 2016, stable net profit growth reflects strong operational management and cost control."

Function name

Model name

Model Function Description

CPU per replica

Memory per replica

GPU per replica

Number of replicas

to_doc

ds4sd/docling-models

Convert PDFs to documents.

20 core

100 G

1 GPU (48 G)

1

chunk

recursive-character-text-splitter

Split documents. Split large PDFs.

15 core

30 G

0 GPU (0 G)

1

pdf_embed

BAAI/bge-base-zh-v1.5

Document Embedding.

7 core

30 G

1 GPU (96 G)

1

llm

Qwen/Qwen3-32B

Use a Large Language Model (LLM) to infer retrieved document content based on prompts.

7 core

30 G

1 GPU (96 G)

1

Best practices

Object tables work with Dynamic Tables, AI Functions, vector search, and full-text search to automate processing, searching, and analysis of unstructured data. For details, see: