AliES kernel versions and optimized features-Elasticsearch(ES)-阿里云帮助中心

AliES is a highly tailored kernel for Alibaba Cloud Elasticsearch. It supports all open-source Elasticsearch kernel features and adds capabilities developed by the Alibaba Cloud Elasticsearch team—including metric optimization, thread pooling, circuit breaking optimization, and query and write performance optimization. These additions improve cluster stability and performance, reduce costs, and extend monitoring and O&M coverage. This topic describes new and optimized features in each AliES version.

Elasticsearch V7.16.2

Kernel version 1.7.0

Plug-ins

The aliyun-timestream plug-in is available. Use it to create, modify, query, and delete time series indexes, simplifying time series data management. For more information, see Overview of aliyun-timestream.
Prometheus Querying Language (PromQL) statements are supported for querying data stored in Elasticsearch. For more information, see Integrate Elasticsearch with Prometheus and Grafana based on aliyun-timestream to implement integrated monitoring.

Elasticsearch V7.10.0

Kernel version 1.12.0

Search

The analysis-dynamic-synonym plug-in is available.
Primary shard balancing is supported.
Parameter value lengths in wildcard and prefix queries are now limited.
Complex queries—including terms and prefix queries on keyword fields—are optimized using doc_values. Query performance improves by up to 80% in low-hit-ratio scenarios.
Numeric term and terms queries are optimized using doc_values. Query performance improves by up to 80% in low-hit-ratio scenarios.
BKD-tree term and terms query performance is optimized by 30% using a lazy loading strategy.
OpenStore: Asynchronously executes delete file operations.

Bug fixes

Task management at the storage layer is improved to resolve an issue where RPC-based communication occasionally stalled.
The data replication process is improved to prevent the "fail engine" error on replica nodes.
The replica shard promotion process is improved to prevent index inconsistency between primary and replica shards.

Kernel version 1.10.0

Store/Snapshot

OpenStore:
- The OpenStore storage engine supports smooth upgrades from hot and cold data separation versions to intelligent hybrid storage versions. The upgrade process uses a blue-green deployment and does not affect online services.
- Optimized task scheduling for the OpenStore storage engine. Real-time network traffic control for task concurrency and congestion avoidance significantly improves the efficiency of physical replication and data tiering.
- OpenStore storage performance improvements:
  - Optimized the write path for hybrid storage. The write performance is now on par with native ES indexes.
  - Optimized metadata management to reduce I/O utility glitches on disks and improve query stability.
LuceneVerifyIndexOutput is optimized to improve index restoration speed. For details, see ES pull #96975.

Cluster coordination

ClusterState is no longer referenced by persistent tasks. In large-scale clusters, dedicated master nodes can accumulate high memory usage. To prevent leader election timeouts in those environments, the default value of cluster.election.initial_timeout is changed from 100 milliseconds to 1 second. For details, see ES pull #90724.

Search

Indexing Service stability enhancements:
- Added an unfollow switch. This switch lets you temporarily stop the unfollow process during an exception to speed up troubleshooting.
- Optimized the submission of unfollow tasks to the follower master. This optimization avoids placing pressure on the master due to operations, such as reopen, when an index in the follower cluster is unfollowed.
- Optimized the write retry logic. This resolves stability issues in the Indexing Service caused by write retries during index creation.
End-to-end query timeout is added to control overall query duration. When a timeout occurs, partial results are returned instead of failing the request.
Additional fields are added to access logs.

Bug fixes

OpenStore:
- Fixed a shard allocation failure that was caused by concurrent refresh operations during a primary/standby switchover.
- Fixed an issue where improper use of read/write locks during metadata access caused a `ThreadLocalMap` buildup in query threads.
Indexing Service:
Fixed a query jitter issue caused by the improper use of the network thread pool during cross-cluster physical replication.
Fixed an issue where the DV update index file referenced by Lucene Merge was deleted by concurrent flush operations. For details, see Lucene pull #13017.

Kernel version 1.9.0

Search

The concurrent query framework is reconstructed for Kernel-enhanced Edition clusters, with the following improvements:

JVM heap memory is reused, reducing garbage collection (GC) overhead and improving resource utilization.
Fetch phase duration for raw text retrieval is reduced. With size set to 10,000, the fetch phase is up to 6–10x faster and the overall query duration is reduced by 50%.
The following aggregation types are now supported in concurrent queries: percentile, percentile ranks, sampler, diversified sampler, significant text, geo_distance, geohash_grid, geotile_grid, geo_bounds, geo_centroid, and scripted_metric aggregations.
Fields including traceId and a query duration field are added to end-to-end access logs. Use traceId to trace complete query execution across nodes.
Custom index structure and mapping parsing for raw text are optimized, doubling write performance for raw text.
OpenStore
- For OpenStore intelligent hybrid storage instances, the hybrid storage feature is enabled by default for service indexes to improve usability.
- Improved the reliability of the metastore for OpenStore intelligent hybrid storage instances. For example, metadata is not lost if a local disk fails.
- Eliminated I/O utility glitches caused by writing large files on OpenStore intelligent hybrid storage instances.
- Fixed an issue where improper use of `termVectorsReader` at the Lucene layer caused a `ThreadLocalMap` buildup in ES management threads.
- Query performance improvements for OpenStore storage instances:
  - Enabled a pre-read policy by default for inverted index reads. This improves inverted index query performance by 30%.
  - Accelerated the speed of querying the local Page Cache after a process restarts.

Caching

For scenarios with few primary queries but a large number of subqueries, caching was not applied to subqueries. To enable caching in these scenarios, run the following API call:

PUT _cluster/settings
{
  "persistent": {
    "search.query_cache_get_wait_lock_enable": "true",
    "search.query_cache_skip_factor": "200000000"
  }
}

k-NN

Data inconsistency between primary and replica shards in k-NN query scenarios is resolved.

Bug fixes

Fixed an issue where running GET _cat/node failed after a shard on a node was migrated during a blue-green update.
Resolved an issue where transaction log loss during a power-off restart of an OpenStore intelligent hybrid storage instance caused shard allocation failures.
Resolved an issue where frequent primary shard switchovers on an OpenStore intelligent hybrid storage instance caused index files to be mistakenly deleted, leading to shard allocation failures.

Kernel version 1.8.0

Plug-ins

The aliyun-timestream plug-in is available for Elasticsearch V7.10.0. It enhances storage and query performance for time series data and supports:

Creating, modifying, querying, and deleting time series indexes
Executing PromQL statements to query data stored in Elasticsearch
Writing data to time series indexes using the InfluxDB line protocol

For more information, see Overview of aliyun-timestream, Integrate Elasticsearch with Prometheus and Grafana based on aliyun-timestream to implement integrated monitoring, and Integrate aliyun-timestream with the InfluxDB line protocol.

1.7.1 kernel version release notes

New features

OpenStore intelligent hybrid storage is released. It goes beyond the traditional hot and cold data separation architecture to significantly reduce the complexity of data ingestion and lower storage costs. You do not need to plan and purchase hot data storage space in advance because storage is billed based on actual usage. You also do not need to manually configure hot and cold lifecycle policies because the system automatically performs data tiering.
After you enable the Indexing Service, you can set an unhosting duration to improve the stability of cloud-hosted writes.

Kernel version 1.7.0

Search

The analytic-search plug-in is available. It significantly improves query performance in log scenarios:

If a user cluster has a high write volume, you can smoothly migrate it to a resource space with more resources to relieve pressure on the Indexing Service cluster. This has no impact on the user cluster.
Index merging policies and date histogram aggregation policies are optimized. Unconditional or single-condition queries—such as those on the Kibana Discover page—are more than 6x faster in log query scenarios. In environments ingesting more than 1 TB of data per day, query time drops from minutes to 5 seconds or less.
Concurrent data recall is supported for concurrent queries, improving resource utilization and reducing average data recall time by 50% in log scenarios.
Read-only small segments are continuously merged before force merge, improving query performance by 20%.

Performance improvements

Write requests between client nodes and data nodes are compressed using LZ4. This reduces network bandwidth overhead by 30%.
Force merge can run in parallel across shards, reducing the total force merge duration.
Large data blocks in raw text can be compressed, and zstd compression parameters are optimized, reducing raw text size by 8%. The Patched Frame of Reference (PFOR) method is also supported for Lucene postings, reducing index size by an additional 3%.

Bug fixes

Fixed an issue where the source_reuse_doc_values feature of the aliyun-codec plug-in did not support fields whose names contained periods (.).
Fixed an issue where a Performance-enhanced Edition cluster with the Indexing Service enabled would get stuck in the `relocating` state for a long time when `force_merge` was executed to unhost a shard.

Kernel version 1.6.0

Compression

Supports the OpenStore hot and cold shared computing mode. This mode provides data nodes for shared computing of hot and cold data and optimizes query performance. You can store massive amounts of log data without purchasing separate cold data nodes. For more information, see OpenStore intelligent hybrid storage (for log analysis) engine.
Added the cold-search plugin. It supports shared querying of hot and cold data for OpenStore and meets the CPU and memory isolation and circuit breaking requirements for hot and cold indexes. For more information, see Use the hot and cold isolation plugin (cold-search).
Supports OpenStore storage to provide low-cost storage capabilities. For more information, see OpenStore intelligent hybrid storage (for log analysis) engine.
The source_reuse_doc_values feature is added to the aliyun-codec plug-in to further reduce index sizes and storage costs. For more information, see Use the aliyun-codec plug-in.
Optimized the Indexing Service feature and fixed a bug where the underlying index was not deleted after the frontend index was deleted. For more information, see Introduction to the Indexing Service series.

Throttling

The aliyun-qos plug-in is updated to V2.0, adding finer-grained throttling types and parameters. For more information, see Use the aliyun-qos plug-in.

Kernel version 1.5.0

Compression

The aliyun-codec plug-in is available to enhance kernel-level compression for clusters. For more information, see Use the aliyun-codec plug-in.

Bug fixes

Fixed a bug related to the search_as_you_type field type. For details, see GitHub issue #65319.

Kernel version 1.4.0

Search

The aliyun-knn plug-in is updated with improved write performance, script query support, and optimized vector search using hardware-level optimizations.

Throttling

The aliyun-qos plug-in is optimized for cluster-level throttling. Traffic is automatically distributed across nodes without requiring knowledge of cluster topology or node load, improving cluster usability and stability.
Supports the Indexing Service series to improve the stability of tenant clusters.

Kernel version 1.3.0

Search

Slow query isolation is available to limit the impact of anomalous queries on cluster stability.
The gig plug-in is available. It performs a switchover within seconds when an exception occurs on a cluster node, preventing query jitter caused by anomalous nodes.
For Elasticsearch V7.10.0 Standard Edition clusters, the gig plug-in is integrated into the aliyun-qos plug-in, which is installed by default.

Replication

Physical replication is available to improve write performance for indexes with replica shards.

Time series

The pruning feature is available for time series indexes to improve query performance.

Observability

Cluster access logs can be viewed. Logs include fields such as Time, Node IP, and Content. Use these logs to troubleshoot issues and analyze requests.

Cluster management

Dedicated master node scheduling performance is improved by 10x, allowing each dedicated master node to schedule more shards.

Elasticsearch V6.7.0

Kernel version 1.3.0

Search

Slow query isolation is available to limit the impact of anomalous queries on cluster stability.
The gig plug-in is available. It performs a switchover within seconds when an exception occurs on a cluster node, preventing query jitter caused by anomalous nodes.

Important

Before using these features, confirm that your cluster runs kernel version V1.3.0. If needed, upgrade the kernel. Kernel upgrades are supported only for Standard Edition clusters running kernel V0.3.0, V1.0.2, or V1.3.0.

Kernel version 1.2.0

Replication

Physical replication is available to improve write performance for indexes with replica shards.

Time series

The pruning feature is available for time series indexes to improve query performance.

Write performance

Primary key-based data deduplication during queries is optimized, improving write performance for documents with primary keys by 10%.

Storage

Finite state transducers (FSTs) that do not occupy JVM heap memory are supported. A single node can store up to 20 TiB of index data.

Kernel version 1.0.2

Observability

Cluster access logs can be viewed. Logs include fields such as Time, Node IP, and Content. Use these logs to troubleshoot issues and analyze requests.

Kernel version 1.0.1

Circuit breaking

Circuit breaking policies for JVMs are configurable. When JVM heap memory usage reaches 95%, the cluster rejects incoming requests to protect stability. Configure the following parameters:

Parameter	Default
`indices.breaker.total.use_real_memory`	`false`
`indices.breaker.total.limit`	`95%`

Kernel version 0.3.0

Cluster management

Dedicated master node scheduling performance is improved by 10x, allowing each dedicated master node to schedule more shards.

Write performance

Write performance is improved by 10% and translog flush overhead is reduced.