MetricStore HTTP API return values

更新时间:
复制 MD 格式

Simple Log Service (SLS) provides several APIs to query time-series metrics or write metric data to MetricStore. These APIs are compatible with the open source Prometheus protocol. This topic describes the return values of these APIs.

Response structure

The response structure of MetricStore Prometheus write and query APIs follows the official Prometheus definition. SLS adds the slsStatus field to provide more detailed error messages and handling policies.

The response structure of the Prometheus API is as follows:

HTTP Header :
x-sls-request-id : <string>  // The RequestId of this request

HTTP Response Body :
{
  "status": "success" | "error",
  "data": <data>,

  // Internal error message returned by SLS. This field exists only when a warning or error occurs during a query.
  "slsStatus" : {
     // Retry policy
     "retryPolicy" : "None" | "Once" | "Continuous",
     // Error code
     "errorCode" : "<string>",
     // Error message
     "errorMessages" : ["<string>"]
  }

  // The following two items are returned when an error occurs during query analysis.
  "errorType": "<string>",
  "error": "<string>",
  
  // Returns a warning message, which usually indicates an incomplete query.
  "warnings": ["<string>"],
  // Returns general information that usually does not require special handling.
  "infos": ["<string>"]
}

Retry policy

The retry policy guides the client's retry behavior when an error occurs. This prevents ineffective retries that can degrade the user experience.

Retry policy

Description

Recommended action

Once

The error may be caused by factors such as throttling. Retrying once might result in the same error, but there is a small chance of recovery.

Wait for a period of time (more than 300 ms is recommended) and then retry once. If the same return code (Once) is received again, stop retrying.

Continuous

A server-side issue exists. Retry continuously at intervals. The service may recover.

Use an annealing strategy to retry. The recommended annealing method is:

  1. Wait for 300 ms before the first retry.

  2. If the Continuous policy is returned again, double the wait time. If the wait time exceeds 10s, cap it at 10s.

  3. If a policy other than Continuous is returned, follow the handling logic for that policy.

None

If the same request is sent again, this error is certain to occur again.

Do not retry.

Status codes and recommended actions

Request type

HTTP status code

ErrorCode

Description

Retry policy

Recommended action

Query & Write

200

None

The request was successful.

None

None

Query

200

ShardPartialSuccess

Execution failed on some shards. This is usually caused by a full queue or a breakdown.

Continuous

Retry using an annealing strategy. For more information, see the Continuous retry policy.

Query

200

ShardResourceExceed

The requested data volume exceeds the shard read limit. For more information, see MetricStore.

Once

Retry once. The error is likely to occur again on the next request. Split the shard or use the parallel computing mode.

Query

200

EngineResourceExceed

This error is usually returned because the compute engine detects that the data volume far exceeds the limit before execution. For more information, see MetricStore.

None

Narrow the time range or optimize the query to reduce the amount of data scanned. You can also use the parallel computing mode.

Query

200

BadDataWarning

A calculation error caused by a data issue. This is common in scenarios where data does not match for multi-element operators in Prometheus Query Language (PromQL). In this case, only partial calculation results are retained.

None

Check if the data and query meet your business requirements. Optimize the data or query.

Query & Write

400

BadParameterError

Query parameter or syntax error.

None

View the error message and correct the statement or parameter.

Query & Write

422

BadDataError

A calculation or write error caused by a data issue.

None

Check if the data and query meet your business requirements. Optimize the data or query.

Query

422

EngineExecutionExceed

The volume of data involved in the calculation in the compute engine exceeds the limit. For more information, see MetricStore.

None

Narrow the time range or optimize the query to reduce the amount of data scanned. You can also use the parallel computing mode.

Query & Write

401

Unauthorized

Permission denied for the operation.

None

For more information, see Overview and grant permissions to the account.

Query & Write

404

ProjectNotExist

The project does not exist.

None

If the project has not been deleted, check if the project name and endpoint are correct.

Query & Write

404

MetricStoreNotExist

The MetricStore does not exist.

None

If the MetricStore has not been deleted, check if the MetricStore name is correct.

Query

503

EngineQueueTimeout

The engine queue timed out. This is usually caused by high engine load.

Continuous

Retry using an annealing strategy. If the error persists, submit a ticket to contact technical support.

Query

500

EngineExecutionError

An unknown error occurred during engine execution. This is usually caused by an invalid data format.

Once

Retry once. If the error persists, submit a ticket.

Query

502

EngineExecutionTimeout

The engine timed out while executing the query. This is usually caused by high engine load or high query complexity.

Once

Retry once. If the error persists, submit a ticket.

Write

500

WriteQuotaExceed

A write error caused by exceeding the project or shard quota. For more information, see Data read and write operations.

  • Note: When a quota is exceeded, the SLS OpenAPI returns the HTTP status code 403. However, because the official Prometheus protocol does not define a return type for quota exceeded errors, the status code is set to 500 to ensure that open source agents can retry properly.

Continuous

Use an annealing strategy for write retries. Also, check the error message. If the project quota is exceeded, submit a ticket to contact technical support to adjust the quota. If the store quota is exceeded, enable automatic shard splitting for the store or split the shards manually.

Query & Write

500

InternalServerError

Other internal system errors.

Continuous

Retry using an annealing strategy. If the error persists, submit a ticket.

Response examples

Successful request

{
    "status": "success",
    "data": {
        "resultType": "matrix",
        "result": [
            {
                "metric": {},
                "values": [
                    [
                        1673798460,
                        "11111111"
                    ],
                    [
                        1673799060,
                        "22222222"
                    ],
                    [
                        1673799660,
                        "33333333"
                    ]
                ]
            }
        ]
    }
}

Partial data returned

When a query is incomplete, partial data is returned. In this case, the slsStatus field contains a detailed error message and a retry policy.

{
	"status": "success",
	"slsStatus": {
		"retryPolicy": "Once",
		"errorCode": "ShardResourceExceed",
		"errorMessages": ["sls response error 1 task data", "sls response drop 1 task data"]
	},
	"data": {
		"resultType": "vector",
		"result": [{
			"metric": {
				"__name__": "up",
			},
			value: [
				1747807789.37,
				"1"
			]
		}]
	},
	"warnings": ["sls response error 1 task data", "sls response drop 1 task data"]
}

Returned errors

Example 1

{
	"status": "error",
	"slsStatus": {
		"errRetryPolicy": "None",
		"errCode": "EngineResourceExceed",
		"errMessages": ["too many time Series or items"]
	},
	"data": {
	},
	"error": "too many time Series or items"
}

Example 2

{
	"status": "error",
	"slsStatus": {
		"retryPolicy": "None",
		"errorCode": "BadDataError",
		"errorMessages": ["vector cannot contain metrics with the same labelset"]
	},
        "data": {
	},
	"error": "vector cannot contain metrics with the same labelset"
}

Scenarios

Frontend display

MetricStore features built-in metric visualization, supports building Dashboards, and is compatible with Grafana. If a query is abnormal, the built-in visualization solution automatically displays the relevant information on the chart.

  • Use Grafana 10.0 or later for better warning messages.

If you build a custom visualization solution based on the MetricStore Prometheus API, handle exceptions based on the following policies:

  1. If the HTTP status code is 200, the chart is rendered normally.

    1. If the slsStatus field exists, display a warning message on the chart and provide a solution based on the RetryPolicy.

  2. If the HTTP status code is not 200, display the error message directly on the chart and provide a solution based on the RetryPolicy.

Secondary data calculation scenario

MetricStore supports using Scheduled SQL to periodically calculate aggregate data. If you build custom secondary calculation scenarios based on the Prometheus API for MetricStore, we recommend that you handle abnormalities according to the following policies:

  1. If the status code is 200:

    1. If the slsStatus field does not exist, store the data and proceed to the next scheduled job.

    2. If the slsStatus field exists, handle the exception based on the RetryPolicy as follows:

      1. None: Retry once. If the policy is still None, mark the task as failed and proceed to the next scheduled job.

      2. Once: Retry once. If the policy is still Once, mark the task as failed and proceed to the next scheduled job.

      3. Continuous: Retry using an annealing strategy. If the retries continue for a certain period (10 to 30 minutes is recommended), mark the task as failed and proceed to the next scheduled job.

  2. If the status code is not 200:

    1. If the slsStatus field does not exist, the error is usually caused by network issues. Continuous retries are recommended.

    2. If the slsStatus field exists, handle the exception based on the RetryPolicy as follows:

      1. None: Retry once. If the policy is still None, mark the task as failed and proceed to the next scheduled job.

      2. Once: Retry once. If the policy is still Once, mark the task as failed and proceed to the next scheduled job.

      3. Continuous: Retry using an annealing strategy. If the retries continue for a certain period (10 to 30 minutes is recommended), mark the task as failed and proceed to the next scheduled job.

Real-time alerting scenario

SLS lets you create alerts directly on MetricStore. However, if you build a custom real-time alerting system based on the Prometheus API for MetricStore, you can handle abnormalities using the following policies:

  1. If the status code is 200:

    1. If the slsStatus field does not exist, monitor the alert conditions normally.

    2. If the slsStatus field exists, handle the exception based on the RetryPolicy as follows:

      1. None: Mark the current alert execution as failed and proceed to the next scheduled execution.

      2. Once: Retry once. If the policy is still Once, mark the current alert execution as failed and proceed to the next scheduled execution.

      3. Continuous: Retry using an annealing strategy. If the retries continue for a certain period (within 1 minute is recommended), mark the current alert execution as failed and proceed to the next scheduled execution.

  2. If the status code is not 200:

    1. If the slsStatus field does not exist, the error is usually caused by network issues. Continuous retries are recommended.

    2. If the slsStatus field exists, handle the exception based on the RetryPolicy as follows:

      1. None: Mark the current alert execution as failed and proceed to the next scheduled execution.

      2. Once: Retry once. If the policy is still Once, mark the current alert execution as failed and proceed to the next scheduled execution.

      3. Continuous: Retry using an annealing strategy. If the retries continue for a certain period (within 1 minute is recommended), mark the current alert execution as failed and proceed to the next scheduled execution.

  • Recommendations:

    • The total retry time should not exceed the interval between scheduled alerts. For example, if the alert interval is 1 minute and the total retry time exceeds 1 minute, skip the current execution and proceed to the next one. Otherwise, you may miss alerts for new and more important data.

    • We recommend that you write data from failed alert executions to a separate LogStore and configure alerts as needed. For more information, see Custom analysis of alert logs.