本文通过示例为您介绍,如何快速将数据随机写入ClickHouse集群各个节点的本地表。
前提条件
已创建ClickHouse集群,详情请参见创建ClickHouse集群。
操作步骤
使用SSH方式登录ClickHouse集群,详情请参见登录集群。
执行以下命令,下载官方样例数据集。
curl https://datasets.clickhouse.com/hits/tsv/hits_v1.tsv.xz | unxz --threads=`nproc` > hits_v1.tsv
执行如下命令,启动ClickHouse客户端。
clickhouse-client -h core-1-1 -m
说明本示例登录core-1-1节点,如果您有多个Core节点,可以登录任意一个节点。
执行如下命令,创建数据库。
可以使用on CLUSTER参数在集群的所有节点创建数据库,默认集群标识为cluster_emr。
CREATE DATABASE IF NOT EXISTS demo on CLUSTER cluster_emr;
返回信息如下所示。
在集群上的所有节点创建一张复制表(Replicated表)。
复制表(Replicated表)会根据副本的个数,实现数据的多副本,并实现数据的最终一致性。
CREATE TABLE demo.hits_local ON CLUSTER cluster_emr ( `WatchID` UInt64, `JavaEnable` UInt8, `Title` String, `GoodEvent` Int16, `EventTime` DateTime, `EventDate` Date, `CounterID` UInt32, `ClientIP` UInt32, `ClientIP6` FixedString(16), `RegionID` UInt32, `UserID` UInt64, `CounterClass` Int8, `OS` UInt8, `UserAgent` UInt8, `URL` String, `Referer` String, `URLDomain` String, `RefererDomain` String, `Refresh` UInt8, `IsRobot` UInt8, `RefererCategories` Array(UInt16), `URLCategories` Array(UInt16), `URLRegions` Array(UInt32), `RefererRegions` Array(UInt32), `ResolutionWidth` UInt16, `ResolutionHeight` UInt16, `ResolutionDepth` UInt8, `FlashMajor` UInt8, `FlashMinor` UInt8, `FlashMinor2` String, `NetMajor` UInt8, `NetMinor` UInt8, `UserAgentMajor` UInt16, `UserAgentMinor` FixedString(2), `CookieEnable` UInt8, `JavascriptEnable` UInt8, `IsMobile` UInt8, `MobilePhone` UInt8, `MobilePhoneModel` String, `Params` String, `IPNetworkID` UInt32, `TraficSourceID` Int8, `SearchEngineID` UInt16, `SearchPhrase` String, `AdvEngineID` UInt8, `IsArtifical` UInt8, `WindowClientWidth` UInt16, `WindowClientHeight` UInt16, `ClientTimeZone` Int16, `ClientEventTime` DateTime, `SilverlightVersion1` UInt8, `SilverlightVersion2` UInt8, `SilverlightVersion3` UInt32, `SilverlightVersion4` UInt16, `PageCharset` String, `CodeVersion` UInt32, `IsLink` UInt8, `IsDownload` UInt8, `IsNotBounce` UInt8, `FUniqID` UInt64, `HID` UInt32, `IsOldCounter` UInt8, `IsEvent` UInt8, `IsParameter` UInt8, `DontCountHits` UInt8, `WithHash` UInt8, `HitColor` FixedString(1), `UTCEventTime` DateTime, `Age` UInt8, `Sex` UInt8, `Income` UInt8, `Interests` UInt16, `Robotness` UInt8, `GeneralInterests` Array(UInt16), `RemoteIP` UInt32, `RemoteIP6` FixedString(16), `WindowName` Int32, `OpenerName` Int32, `HistoryLength` Int16, `BrowserLanguage` FixedString(2), `BrowserCountry` FixedString(2), `SocialNetwork` String, `SocialAction` String, `HTTPError` UInt16, `SendTiming` Int32, `DNSTiming` Int32, `ConnectTiming` Int32, `ResponseStartTiming` Int32, `ResponseEndTiming` Int32, `FetchTiming` Int32, `RedirectTiming` Int32, `DOMInteractiveTiming` Int32, `DOMContentLoadedTiming` Int32, `DOMCompleteTiming` Int32, `LoadEventStartTiming` Int32, `LoadEventEndTiming` Int32, `NSToDOMContentLoadedTiming` Int32, `FirstPaintTiming` Int32, `RedirectCount` Int8, `SocialSourceNetworkID` UInt8, `SocialSourcePage` String, `ParamPrice` Int64, `ParamOrderID` String, `ParamCurrency` FixedString(3), `ParamCurrencyID` UInt16, `GoalsReached` Array(UInt32), `OpenstatServiceName` String, `OpenstatCampaignID` String, `OpenstatAdID` String, `OpenstatSourceID` String, `UTMSource` String, `UTMMedium` String, `UTMCampaign` String, `UTMContent` String, `UTMTerm` String, `FromTag` String, `HasGCLID` UInt8, `RefererHash` UInt64, `URLHash` UInt64, `CLID` UInt32, `YCLID` UInt64, `ShareService` String, `ShareURL` String, `ShareTitle` String, `ParsedParams` Nested(Key1 String,Key2 String,Key3 String,Key4 String,Key5 String,ValueDouble Float64), `IslandID` FixedString(16), `RequestNum` UInt32, `RequestTry` UInt8 ) ENGINE = ReplicatedMergeTree('/clickhouse/tables/{shard}/{database}/hits_local', '{replica}') PARTITION BY toYYYYMM(EventDate) ORDER BY (CounterID, EventDate, intHash32(UserID)) SAMPLE BY intHash32(UserID);
说明{shard}和{replica}是阿里云EMR为ClickHouse集群自动生成的宏定义,可以直接使用。
执行以下命令,创建分布式(Distributed)表。
分布式表不存储数据,仅仅是底层表的一个View,但可以在多个服务器上进行分布式查询。本例中使用随机函数rand(),表示数据会随机写入各个节点的本地表。
CREATE TABLE demo.hits_all on CLUSTER cluster_emr AS demo.hits_local ENGINE = Distributed(cluster_emr, demo, hits_local, rand());
退出ClickHouse客户端,在样例数据的目录下执行以下命令,导入数据。
clickhouse-client -h core-1-1 --query "INSERT INTO demo.hits_all FORMAT TSV" --max_insert_block_size=100000 < hits_v1.tsv;
重新启动ClickHouse客户端,查看数据。
因为数据是随机写入的,各节点数据量可能不同。
查看core-1-1节点demo.hits_all的数据量。
select count(*) from demo.hits_all;
查看core-1-1节点demo.hits_local的数据量。
select count(*) from demo.hits_local;
查看core-1-2节点demo.hits_local的数据量。
说明其余节点,您也可以按照以下步骤来查看demo.hits_local的数据量。节点名称您可以在EMR控制台的节点管理页面查看。
执行以下命令,登录ClickHouse客户端。
clickhouse-client -h core-1-2 -m
在ClickHouse客户端,执行以下命令,查看demo.hits_local的数据量。
select count(*) from demo.hits_local;