MetricStore HTTP API return values-Simple Log Service(SLS)-阿里云帮助中心

Simple Log Service (SLS) provides several APIs to query time-series metrics or write metric data to MetricStore. These APIs are compatible with the open source Prometheus protocol. This topic describes the return values of these APIs.

Response structure

The response structure of MetricStore Prometheus write and query APIs follows the official Prometheus definition. SLS adds the slsStatus field to provide more detailed error messages and handling policies.

The response structure of the Prometheus API is as follows:

HTTP Header :
x-sls-request-id : <string>  // The RequestId of this request

HTTP Response Body :
{
  "status": "success" | "error",
  "data": <data>,

  // Internal error message returned by SLS. This field exists only when a warning or error occurs during a query.
  "slsStatus" : {
     // Retry policy
     "retryPolicy" : "None" | "Once" | "Continuous",
     // Error code
     "errorCode" : "<string>",
     // Error message
     "errorMessages" : ["<string>"]
  }

  // The following two items are returned when an error occurs during query analysis.
  "errorType": "<string>",
  "error": "<string>",
  
  // Returns a warning message, which usually indicates an incomplete query.
  "warnings": ["<string>"],
  // Returns general information that usually does not require special handling.
  "infos": ["<string>"]
}

Retry policy

The retry policy guides the client's retry behavior when an error occurs. This prevents ineffective retries that can degrade the user experience.

Retry policy	Description	Recommended action
Once	The error may be caused by factors such as throttling. Retrying once might result in the same error, but there is a small chance of recovery.	Wait for a period of time (more than 300 ms is recommended) and then retry once. If the same return code (Once) is received again, stop retrying.
Continuous	A server-side issue exists. Retry continuously at intervals. The service may recover.	Use an annealing strategy to retry. The recommended annealing method is: Wait for 300 ms before the first retry. If the Continuous policy is returned again, double the wait time. If the wait time exceeds 10s, cap it at 10s. If a policy other than Continuous is returned, follow the handling logic for that policy.
None	If the same request is sent again, this error is certain to occur again.	Do not retry.

Status codes and recommended actions

Request type	HTTP status code	ErrorCode	Description	Retry policy	Recommended action
Query & Write	200	None	The request was successful.	None	None
Query	200	ShardPartialSuccess	Execution failed on some shards. This is usually caused by a full queue or a breakdown.	Continuous	Retry using an annealing strategy. For more information, see the Continuous retry policy.
Query	200	ShardResourceExceed	The requested data volume exceeds the shard read limit. For more information, see MetricStore.	Once	Retry once. The error is likely to occur again on the next request. Split the shard or use the parallel computing mode.
Query	200	EngineResourceExceed	This error is usually returned because the compute engine detects that the data volume far exceeds the limit before execution. For more information, see MetricStore.	None	Narrow the time range or optimize the query to reduce the amount of data scanned. You can also use the parallel computing mode.
Query	200	BadDataWarning	A calculation error caused by a data issue. This is common in scenarios where data does not match for multi-element operators in Prometheus Query Language (PromQL). In this case, only partial calculation results are retained.	None	Check if the data and query meet your business requirements. Optimize the data or query.
Query & Write	400	BadParameterError	Query parameter or syntax error.	None	View the error message and correct the statement or parameter.
Query & Write	422	BadDataError	A calculation or write error caused by a data issue.	None	Check if the data and query meet your business requirements. Optimize the data or query.
Query	422	EngineExecutionExceed	The volume of data involved in the calculation in the compute engine exceeds the limit. For more information, see MetricStore.	None	Narrow the time range or optimize the query to reduce the amount of data scanned. You can also use the parallel computing mode.
Query & Write	401	Unauthorized	Permission denied for the operation.	None	For more information, see Overview and grant permissions to the account.
Query & Write	404	ProjectNotExist	The project does not exist.	None	If the project has not been deleted, check if the project name and endpoint are correct.
Query & Write	404	MetricStoreNotExist	The MetricStore does not exist.	None	If the MetricStore has not been deleted, check if the MetricStore name is correct.
Query	503	EngineQueueTimeout	The engine queue timed out. This is usually caused by high engine load.	Continuous	Retry using an annealing strategy. If the error persists, submit a ticket to contact technical support.
Query	500	EngineExecutionError	An unknown error occurred during engine execution. This is usually caused by an invalid data format.	Once	Retry once. If the error persists, submit a ticket.
Query	502	EngineExecutionTimeout	The engine timed out while executing the query. This is usually caused by high engine load or high query complexity.	Once	Retry once. If the error persists, submit a ticket.
Write	500	WriteQuotaExceed	A write error caused by exceeding the project or shard quota. For more information, see Data read and write operations. Note: When a quota is exceeded, the SLS OpenAPI returns the HTTP status code 403. However, because the official Prometheus protocol does not define a return type for quota exceeded errors, the status code is set to 500 to ensure that open source agents can retry properly.	Continuous	Use an annealing strategy for write retries. Also, check the error message. If the project quota is exceeded, submit a ticket to contact technical support to adjust the quota. If the store quota is exceeded, enable automatic shard splitting for the store or split the shards manually.
Query & Write	500	InternalServerError	Other internal system errors.	Continuous	Retry using an annealing strategy. If the error persists, submit a ticket.

Response examples

Successful request

{
    "status": "success",
    "data": {
        "resultType": "matrix",
        "result": [
            {
                "metric": {},
                "values": [
                    [
                        1673798460,
                        "11111111"
                    ],
                    [
                        1673799060,
                        "22222222"
                    ],
                    [
                        1673799660,
                        "33333333"
                    ]
                ]
            }
        ]
    }
}

Partial data returned

When a query is incomplete, partial data is returned. In this case, the slsStatus field contains a detailed error message and a retry policy.

{
	"status": "success",
	"slsStatus": {
		"retryPolicy": "Once",
		"errorCode": "ShardResourceExceed",
		"errorMessages": ["sls response error 1 task data", "sls response drop 1 task data"]
	},
	"data": {
		"resultType": "vector",
		"result": [{
			"metric": {
				"__name__": "up",
			},
			value: [
				1747807789.37,
				"1"
			]
		}]
	},
	"warnings": ["sls response error 1 task data", "sls response drop 1 task data"]
}

Returned errors

Example 1

{
	"status": "error",
	"slsStatus": {
		"errRetryPolicy": "None",
		"errCode": "EngineResourceExceed",
		"errMessages": ["too many time Series or items"]
	},
	"data": {
	},
	"error": "too many time Series or items"
}

Example 2

{
	"status": "error",
	"slsStatus": {
		"retryPolicy": "None",
		"errorCode": "BadDataError",
		"errorMessages": ["vector cannot contain metrics with the same labelset"]
	},
        "data": {
	},
	"error": "vector cannot contain metrics with the same labelset"
}

Scenarios

Frontend display

MetricStore features built-in metric visualization, supports building Dashboards, and is compatible with Grafana. If a query is abnormal, the built-in visualization solution automatically displays the relevant information on the chart.

Use Grafana 10.0 or later for better warning messages.

If you build a custom visualization solution based on the MetricStore Prometheus API, handle exceptions based on the following policies:

If the HTTP status code is 200, the chart is rendered normally.
1. If the slsStatus field exists, display a warning message on the chart and provide a solution based on the RetryPolicy.
If the HTTP status code is not 200, display the error message directly on the chart and provide a solution based on the RetryPolicy.

Secondary data calculation scenario

MetricStore supports using Scheduled SQL to periodically calculate aggregate data. If you build custom secondary calculation scenarios based on the Prometheus API for MetricStore, we recommend that you handle abnormalities according to the following policies:

If the status code is 200:
1. If the slsStatus field does not exist, store the data and proceed to the next scheduled job.
2. If the slsStatus field exists, handle the exception based on the RetryPolicy as follows:
  1. None: Retry once. If the policy is still None, mark the task as failed and proceed to the next scheduled job.
  2. Once: Retry once. If the policy is still Once, mark the task as failed and proceed to the next scheduled job.
  3. Continuous: Retry using an annealing strategy. If the retries continue for a certain period (10 to 30 minutes is recommended), mark the task as failed and proceed to the next scheduled job.
If the status code is not 200:
1. If the slsStatus field does not exist, the error is usually caused by network issues. Continuous retries are recommended.
2. If the slsStatus field exists, handle the exception based on the RetryPolicy as follows:
  1. None: Retry once. If the policy is still None, mark the task as failed and proceed to the next scheduled job.
  2. Once: Retry once. If the policy is still Once, mark the task as failed and proceed to the next scheduled job.
  3. Continuous: Retry using an annealing strategy. If the retries continue for a certain period (10 to 30 minutes is recommended), mark the task as failed and proceed to the next scheduled job.

We recommend that you write data from failed tasks to a separate LogStore and configure alerts as needed. For more information, see Set alerts for scheduled SQL tasks.

Real-time alerting scenario

SLS lets you create alerts directly on MetricStore. However, if you build a custom real-time alerting system based on the Prometheus API for MetricStore, you can handle abnormalities using the following policies:

If the status code is 200:
1. If the slsStatus field does not exist, monitor the alert conditions normally.
2. If the slsStatus field exists, handle the exception based on the RetryPolicy as follows:
  1. None: Mark the current alert execution as failed and proceed to the next scheduled execution.
  2. Once: Retry once. If the policy is still Once, mark the current alert execution as failed and proceed to the next scheduled execution.
  3. Continuous: Retry using an annealing strategy. If the retries continue for a certain period (within 1 minute is recommended), mark the current alert execution as failed and proceed to the next scheduled execution.
If the status code is not 200:
1. If the slsStatus field does not exist, the error is usually caused by network issues. Continuous retries are recommended.
2. If the slsStatus field exists, handle the exception based on the RetryPolicy as follows:
  1. None: Mark the current alert execution as failed and proceed to the next scheduled execution.
  2. Once: Retry once. If the policy is still Once, mark the current alert execution as failed and proceed to the next scheduled execution.
  3. Continuous: Retry using an annealing strategy. If the retries continue for a certain period (within 1 minute is recommended), mark the current alert execution as failed and proceed to the next scheduled execution.

Recommendations:
- The total retry time should not exceed the interval between scheduled alerts. For example, if the alert interval is 1 minute and the total retry time exceeds 1 minute, skip the current execution and proceed to the next one. Otherwise, you may miss alerts for new and more important data.
- We recommend that you write data from failed alert executions to a separate LogStore and configure alerts as needed. For more information, see Custom analysis of alert logs.