LindormHBase数据入库与ETL的应用案例与最佳实践_云原生数据湖分析（文档停止维护）-阿里云帮助中心

矢量数据入库

Lindorm（HBase）矢量数据导入，请参见快速入门。

栅格数据入库

Pipeline技术

Pipeline模型是DLA Ganos基于GeoTrellis开源项目开发的用于栅格数据快速加载、处理和入库的ETL技术。

Pipeline模型包含了一系列功能模块：如读取数据（Load），转换（Transform），保存数据（Save）等。DLA Ganos Pipeline模型一般表示为一个JSON对象，其主要对象称为pipeline，该对象是要执行的步骤数组（还有一些JSON对象，我们将其称为Stage Objects）。DLA Ganos整个入库操作的Pipeline流程与相关参数全部通过一个JSON对象进行定义，一个简单的JSON脚本如下所示：

[
  {
    "uri" : "OSS资源URI",
    "type" : "singleband.spatial.read.oss"
  },
  {
    "resample_method" : "nearest-neighbor",
    "type" : "singleband.spatial.transform.tile-to-layout"
  },
  {
    "crs" : "EPSG:3857",
    "scheme" : {
      "crs" : "epsg:3857",
      "tileSize" : 256,
      "resolutionThreshold" : 0.1
    },
    "resample_method" : "nearest-neighbor",
    "type" : "singleband.spatial.transform.buffered-reproject"
  },
  {
    "end_zoom" : 0,
    "resample_method" : "nearest-neighbor",
    "type" : "singleband.spatial.transform.pyramid"
  },
  {
    "name" : "mask",
    "uri" : "oss://geotrellis-test/colingw/pipeline/",
    "key_index_method" : {
      "type" : "zorder"
    },
    "scheme" : {
      "crs" : "epsg:3857",
      "tileSize" : 256,
      "resolutionThreshold" : 0.1
    },
    "type" : "singleband.spatial.write"
  }
]

入库流程

导入相关依赖。

import geotrellis.layer._
import geotrellis.spark.pipeline._
import geotrellis.spark.pipeline.json._
import geotrellis.spark._
import geotrellis.spark.store.kryo.KryoRegistrator
import org.apache.spark.{SparkConf, SparkContext}
import scala.util.{Failure, Try}

初始化Spark环境。

  val conf =
    new SparkConf（）
      .setMaster（"local[*]"）
      .setAppName（"Spark Tiler"）
      .set（"spark.serializer", "org.apache.spark.serializer.KryoSerializer"）
      .set（"spark.kryo.registrator", classOf[KryoRegistrator].getName）
  conf.set（"spark.kryoserializer.buffer.max", "2047m"）
  implicit val sc = new SparkContext（conf）

定义Pipeline JSON描述。

以下示例是一个简单的pipeline模型，该模型定义的操作如下：

定义导入文件的URI与加载驱动。
数据分块模式（tile-to-layout）。
数据转换与冲投影等操作。
数据写入地址（Lindorm）。

该模型的详细配置如下所示：

 val pipeline: String =
    """
      |[
      |  {
      |    "uri" : "OSS资源URI",
      |    "time_tag" : "TIFFTAG_DATETIME",
      |    "time_format" : "yyyy:MM:dd HH:mm:ss",
      |    "type" : "singleband.spatial.read.hadoop"
      |  },
      |  {
      |    "resample_method" : "nearest-neighbor",
      |    "type" : "singleband.spatial.transform.tile-to-layout"
      |  },
      |  {
      |    "crs" : "EPSG:3857",
      |    "scheme" : {
      |      "crs" : "EPSG:3857",
      |      "tileSize" : 256,
      |      "resolutionThreshold":0.1
      |    },
      |    "resample_method" : "nearest-neighbor",
      |    "type" : "singleband.spatial.transform.buffered-reproject"
      |  },
      |  {
      |    "end_zoom" : 0,
      |    "resample_method" : "nearest-neighbor",
      |    "type" : "singleband.spatial.transform.pyramid"
      |  },
      |  {
      |    "name" : "srtm",
      |    "uri" : "hbase://localhost:2181?master=localhost&attributes=attributes&layers=srtm-tms-layers",
      |    "pyramid" : true,
      |    "key_index_method" : {
      |      "type" : "zorder"
      |    },
      |    "scheme" : {
      |      "tileCols" : 256,
      |      "tileRows" : 256
      |    },
      |    "type" : "singleband.spatial.write"
      |  }
      |]
    """.stripMargin

运行Pipeline模型。

//首先解析JSON描述的Pipeline模型，生成表达式集合：
val list: List[PipelineExpr] = pipeline.pipelineExpr match {
    case Right（r） => r
    case Left（e） => throw e
}

//执行pipeline模型：
val erasedNode = list.erasedNode
  Try {
    erasedNode.eval[Stream[（Int, TileLayerRDD[SpatialKey]）]]
  } match {
    case Failure（e） => println（"run failed as expected"）; throw e
    case _ =>
  }

配置文件参考

数据加载objects

{
   "uri" : "{oss| file | hdfs | ...}://...",
   "time_tag" : "TIFFTAG_DATETIME", // optional field
   "time_format" : "yyyy:MM:dd HH:mm:ss", // optional field
   "type" : "{singleband | multiband}.{spatial | temporal}.read.{oss | hadoop}"
}

参数说明如下：


Key	Value
uri	栅格数据源URI
time_tag	数据集元数据中的时间标签名称
type	操作类型

说明这里只有两种类型的读取器可用：通过Hadoop API从S3或从Hadoop支持的文件系统中读取。

数据写入objects

{
   "name" : "layerName",
   "uri" : "{oss| file | hdfs | ...}://...",
   "key_index_method" : {
      "type" : "{zorder | hilbert}",
      "temporal_resolution": 1 // optional, if set - temporal index is used
   },
   "scheme" : {
      "crs" : "epsg:3857",
      "tileSize" : 256,
      "resolutionThreshold" : 0.1
   },
   "type" : "{singleband | multiband}.{spatial | temporal}.write"
}

参数说明如下：


Key	Value
uri	栅格数据源URI
name	图层名称
key_index_method	从空间键（Satial Key）生成索引的键索引方法
key_index_method.type	填充曲线类型：zorder, row-major, hilbert
key_index_method. tmporal_resolution	时间分辨率（单位：毫秒ms)
scheme	目标layout scheme
scheme.crs	目标scheme的crs参数
scheme.tileSize	layout scheme 数据块Tile尺寸
scheme.resolutionThreshold	用户定义的布局方案的分辨率（可选字段）

说明这里只有两种类型的读取器可用：通过Hadoop API从OSS或Hadoop支持的文件系统中读取。

数据转换objects

Tile To Layout
```
{
   "resample_method" : "nearest-neighbor",
   "type" : "{singleband | multiband}.{spatial | temporal}.transform.tile-to-layout"
}
```
说明将RDD[（{ProjectedExtent | TemporalProjectedExtent}, {Tile | MultibandTile})] 转换为 RDD[（{SpatialKey | SpaceTimeKey}, {Tile | MultibandTile})]模型
参数说明如下：
Key Options
resample_method 重采样方法：nearest-neighborbilinearcubic-convolutioncubic-splinelanczos

Key	Options
resample_method	重采样方法：nearest-neighborbilinearcubic-convolutioncubic-splinelanczos

ReTile To Layout

{
   "layout_definition": {
      "extent": [0, 0, 1, 1],
      "tileLayout": {
         "layoutCols": 1,
         "layoutRows": 1,
         "tileCols": 1,
         "tileRows": 1
      }
    },
   "resample_method" : "nearest-neighbor",
   "type" : "{singleband | multiband}.{spatial | temporal}.transform.retile-to-layout"
}

说明将 RDD[（{SpatialKey | SpaceTimeKey}, {Tile | MultibandTile})] 对象按照用户配置的layout definition规则进行重新分块。

Buffered Reproject

{
   "crs" : "EPSG:3857",
   "scheme" : {
      "crs" : "epsg:3857",
      "tileSize" : 256,
      "resolutionThreshold" : 0.1
   },
   "resample_method" : "nearest-neighbor",
   "type" : "{singleband | multiband}.{spatial | temporal}.transform.buffered-reproject"
}

说明将 RDD[（{SpatialKey | SpaceTimeKey}, {Tile | MultibandTile})] 对象按照用户配置的layout scheme参数转换为目标CRS 数据分块。

参数说明如下：


Key	Options
crs	目标scheme的crs参数
tileSize	layout scheme 数据块Tile尺寸
resolutionThreshold	用户定义的布局方案的分辨率（可选字段）
resample_method	重采样方法：nearest-neighborbilinearcubic-convolutioncubic-splinelanczos

Per Tile Reproject

{
   "crs" : "EPSG:3857",
   "scheme" : {
      "crs" : "epsg:3857",
      "tileSize" : 256,
      "resolutionThreshold" : 0.1
   },
   "resample_method" : "nearest-neighbor",
   "type" : "{singleband | multiband}.{spatial | temporal}.transform.per-tile-reproject"
}

说明将 RDD[（{ProjectedExtent | TemporalProjectedExtent}, {Tile | MultibandTile})] 对象按照用户配置的layout scheme参数转换为目标CRS 数据分块。

参数说明如下：


Key	Options
scheme	目标layout scheme
scheme.crs	目标scheme的crs参数
scheme.tileSize	layout scheme 数据块Tile尺寸
scheme. resolutionThreshold	用户定义的布局方案的分辨率（可选字段）
resample_method	重采样方法：nearest-neighborbilinearcubic-convolutioncubic-splinelanczos

Pyramid
```
{
   "end_zoom" : 0,
   "resample_method" : "nearest-neighbor",
   "type" : "{singleband | multiband}.{spatial | temporal}.transform.pyramid"
}
```
说明将RDD[（{SpatialKey | SpaceTimeKey}, {Tile | MultibandTile})] 对象创建金字塔，直到 end_zoom 定义层级为止, 返回类型为Stream[RDD[（{SpatialKey | SpaceTimeKey}, {Tile | MultibandTile})]].

关于Layout Scheme

LA Ganos 支持两种Layout Scheme模式：

ZoomedLayoutScheme
匹配TMS金字塔
重要 ZoomedLayoutScheme需要知道从CRS获取的世界范围，以便构建TMS金字塔布局。这可能会导致重新采样输入栅格以匹配TMS级别的分辨率。
FloatingLayoutScheme
匹配输入栅格的原始分辨率。
重要 FloatingLayoutScheme将发现本机分辨率和范围，并按给定的图块大小对其进行分区，而无需重新采样。

上一篇: 应用案例与最佳实践下一篇: 自定义UDF