Simple Log Service (SLS) provides several APIs to query time-series metrics or write metric data to MetricStore. These APIs are compatible with the open source Prometheus protocol. This topic describes the return values of these APIs.
Response structure
The response structure of MetricStore Prometheus write and query APIs follows the official Prometheus definition. SLS adds the slsStatus field to provide more detailed error messages and handling policies.
The response structure of the Prometheus API is as follows:
HTTP Header :
x-sls-request-id : <string> // The RequestId of this request
HTTP Response Body :
{
"status": "success" | "error",
"data": <data>,
// Internal error message returned by SLS. This field exists only when a warning or error occurs during a query.
"slsStatus" : {
// Retry policy
"retryPolicy" : "None" | "Once" | "Continuous",
// Error code
"errorCode" : "<string>",
// Error message
"errorMessages" : ["<string>"]
}
// The following two items are returned when an error occurs during query analysis.
"errorType": "<string>",
"error": "<string>",
// Returns a warning message, which usually indicates an incomplete query.
"warnings": ["<string>"],
// Returns general information that usually does not require special handling.
"infos": ["<string>"]
}Retry policy
The retry policy guides the client's retry behavior when an error occurs. This prevents ineffective retries that can degrade the user experience.
Retry policy | Description | Recommended action |
Once | The error may be caused by factors such as throttling. Retrying once might result in the same error, but there is a small chance of recovery. | Wait for a period of time (more than 300 ms is recommended) and then retry once. If the same return code (Once) is received again, stop retrying. |
Continuous | A server-side issue exists. Retry continuously at intervals. The service may recover. | Use an annealing strategy to retry. The recommended annealing method is:
|
None | If the same request is sent again, this error is certain to occur again. | Do not retry. |
Status codes and recommended actions
Request type | HTTP status code | ErrorCode | Description | Retry policy | Recommended action |
Query & Write | 200 | None | The request was successful. | None | None |
Query | 200 | ShardPartialSuccess | Execution failed on some shards. This is usually caused by a full queue or a breakdown. | Continuous | Retry using an annealing strategy. For more information, see the Continuous retry policy. |
Query | 200 | ShardResourceExceed | The requested data volume exceeds the shard read limit. For more information, see MetricStore. | Once | Retry once. The error is likely to occur again on the next request. Split the shard or use the parallel computing mode. |
Query | 200 | EngineResourceExceed | This error is usually returned because the compute engine detects that the data volume far exceeds the limit before execution. For more information, see MetricStore. | None | Narrow the time range or optimize the query to reduce the amount of data scanned. You can also use the parallel computing mode. |
Query | 200 | BadDataWarning | A calculation error caused by a data issue. This is common in scenarios where data does not match for multi-element operators in Prometheus Query Language (PromQL). In this case, only partial calculation results are retained. | None | Check if the data and query meet your business requirements. Optimize the data or query. |
Query & Write | 400 | BadParameterError | Query parameter or syntax error. | None | View the error message and correct the statement or parameter. |
Query & Write | 422 | BadDataError | A calculation or write error caused by a data issue. | None | Check if the data and query meet your business requirements. Optimize the data or query. |
Query | 422 | EngineExecutionExceed | The volume of data involved in the calculation in the compute engine exceeds the limit. For more information, see MetricStore. | None | Narrow the time range or optimize the query to reduce the amount of data scanned. You can also use the parallel computing mode. |
Query & Write | 401 | Unauthorized | Permission denied for the operation. | None | For more information, see Overview and grant permissions to the account. |
Query & Write | 404 | ProjectNotExist | The project does not exist. | None | If the project has not been deleted, check if the project name and endpoint are correct. |
Query & Write | 404 | MetricStoreNotExist | The MetricStore does not exist. | None | If the MetricStore has not been deleted, check if the MetricStore name is correct. |
Query | 503 | EngineQueueTimeout | The engine queue timed out. This is usually caused by high engine load. | Continuous | Retry using an annealing strategy. If the error persists, submit a ticket to contact technical support. |
Query | 500 | EngineExecutionError | An unknown error occurred during engine execution. This is usually caused by an invalid data format. | Once | Retry once. If the error persists, submit a ticket. |
Query | 502 | EngineExecutionTimeout | The engine timed out while executing the query. This is usually caused by high engine load or high query complexity. | Once | Retry once. If the error persists, submit a ticket. |
Write | 500 | WriteQuotaExceed | A write error caused by exceeding the project or shard quota. For more information, see Data read and write operations.
| Continuous | Use an annealing strategy for write retries. Also, check the error message. If the project quota is exceeded, submit a ticket to contact technical support to adjust the quota. If the store quota is exceeded, enable automatic shard splitting for the store or split the shards manually. |
Query & Write | 500 | InternalServerError | Other internal system errors. | Continuous | Retry using an annealing strategy. If the error persists, submit a ticket. |
Response examples
Successful request
{
"status": "success",
"data": {
"resultType": "matrix",
"result": [
{
"metric": {},
"values": [
[
1673798460,
"11111111"
],
[
1673799060,
"22222222"
],
[
1673799660,
"33333333"
]
]
}
]
}
}Partial data returned
When a query is incomplete, partial data is returned. In this case, the slsStatus field contains a detailed error message and a retry policy.
{
"status": "success",
"slsStatus": {
"retryPolicy": "Once",
"errorCode": "ShardResourceExceed",
"errorMessages": ["sls response error 1 task data", "sls response drop 1 task data"]
},
"data": {
"resultType": "vector",
"result": [{
"metric": {
"__name__": "up",
},
value: [
1747807789.37,
"1"
]
}]
},
"warnings": ["sls response error 1 task data", "sls response drop 1 task data"]
}Returned errors
Example 1
{
"status": "error",
"slsStatus": {
"errRetryPolicy": "None",
"errCode": "EngineResourceExceed",
"errMessages": ["too many time Series or items"]
},
"data": {
},
"error": "too many time Series or items"
}Example 2
{
"status": "error",
"slsStatus": {
"retryPolicy": "None",
"errorCode": "BadDataError",
"errorMessages": ["vector cannot contain metrics with the same labelset"]
},
"data": {
},
"error": "vector cannot contain metrics with the same labelset"
}Scenarios
Frontend display
MetricStore features built-in metric visualization, supports building Dashboards, and is compatible with Grafana. If a query is abnormal, the built-in visualization solution automatically displays the relevant information on the chart.
Use Grafana 10.0 or later for better warning messages.
If you build a custom visualization solution based on the MetricStore Prometheus API, handle exceptions based on the following policies:
If the HTTP status code is 200, the chart is rendered normally.
If the
slsStatusfield exists, display a warning message on the chart and provide a solution based on theRetryPolicy.
If the HTTP status code is not 200, display the error message directly on the chart and provide a solution based on the
RetryPolicy.
Secondary data calculation scenario
MetricStore supports using Scheduled SQL to periodically calculate aggregate data. If you build custom secondary calculation scenarios based on the Prometheus API for MetricStore, we recommend that you handle abnormalities according to the following policies:
If the status code is 200:
If the
slsStatusfield does not exist, store the data and proceed to the next scheduled job.If the
slsStatusfield exists, handle the exception based on theRetryPolicyas follows:None: Retry once. If the policy is stillNone, mark the task as failed and proceed to the next scheduled job.Once: Retry once. If the policy is stillOnce, mark the task as failed and proceed to the next scheduled job.Continuous: Retry using an annealing strategy. If the retries continue for a certain period (10 to 30 minutes is recommended), mark the task as failed and proceed to the next scheduled job.
If the status code is not 200:
If the
slsStatusfield does not exist, the error is usually caused by network issues. Continuous retries are recommended.If the
slsStatusfield exists, handle the exception based on theRetryPolicyas follows:None: Retry once. If the policy is stillNone, mark the task as failed and proceed to the next scheduled job.Once: Retry once. If the policy is stillOnce, mark the task as failed and proceed to the next scheduled job.Continuous: Retry using an annealing strategy. If the retries continue for a certain period (10 to 30 minutes is recommended), mark the task as failed and proceed to the next scheduled job.
We recommend that you write data from failed tasks to a separate LogStore and configure alerts as needed. For more information, see Set alerts for scheduled SQL tasks.
Real-time alerting scenario
SLS lets you create alerts directly on MetricStore. However, if you build a custom real-time alerting system based on the Prometheus API for MetricStore, you can handle abnormalities using the following policies:
If the status code is 200:
If the
slsStatusfield does not exist, monitor the alert conditions normally.If the
slsStatusfield exists, handle the exception based on theRetryPolicyas follows:None: Mark the current alert execution as failed and proceed to the next scheduled execution.Once: Retry once. If the policy is stillOnce, mark the current alert execution as failed and proceed to the next scheduled execution.Continuous: Retry using an annealing strategy. If the retries continue for a certain period (within 1 minute is recommended), mark the current alert execution as failed and proceed to the next scheduled execution.
If the status code is not 200:
If the
slsStatusfield does not exist, the error is usually caused by network issues. Continuous retries are recommended.If the
slsStatusfield exists, handle the exception based on theRetryPolicyas follows:None: Mark the current alert execution as failed and proceed to the next scheduled execution.Once: Retry once. If the policy is stillOnce, mark the current alert execution as failed and proceed to the next scheduled execution.Continuous: Retry using an annealing strategy. If the retries continue for a certain period (within 1 minute is recommended), mark the current alert execution as failed and proceed to the next scheduled execution.
Recommendations:
The total retry time should not exceed the interval between scheduled alerts. For example, if the alert interval is 1 minute and the total retry time exceeds 1 minute, skip the current execution and proceed to the next one. Otherwise, you may miss alerts for new and more important data.
We recommend that you write data from failed alert executions to a separate LogStore and configure alerts as needed. For more information, see Custom analysis of alert logs.