Preparations
This tool uses a file flow interface to quickly import data into a Cassandra cluster. It is one of the fastest methods to migrate offline data to an online Cassandra cluster. The following preparations are required:
An online Cassandra cluster
Offline data in SSTable or CSV format.
A separate ECS instance in the same VPC. Configure the security group to allow access to the Cassandra cluster.
1. Prepare a client ECS instance in the same VPC
Use a separate ECS instance instead of an instance from the online Cassandra cluster. This prevents the import process from affecting the online service.
2. Create a schema
$ cqlsh -f schema.cql -u USERNAME -p PASSWORD [host]
3. Prepare the data
3.1 SSTable data format
Organize the folders in the `data/${keyspace}/${table}` format and place the SSTable data in the corresponding folder. For example:
ls /tmp/quote/historical_prices/
md-1-big-CompressionInfo.db md-1-big-Data.db md-1-big-Digest.crc32 md-1-big-Filter.db md-1-big-Index.db md-1-big-Statistics.db md-1-big-Summary.db md-1-big-TOC.txt
In the example above, the keyspace is `quota` and the table is `historical_prices`.
Import the data
Run the `sstableloader` command, which is located in the `bin` folder of the Cassandra distribution package. Specify the `data/${ks}/${table}` data folder.
${cassandra_home}/bin/sstableloader -d <ip address of the node> data/${ks}/${table}
Wait for the SSTable data import to complete. Then, use `cqlsh` to verify the data.
$ bin/cqlsh
cqlsh> select * from quote.historical_prices;
ticker | date | adj_close | close | high | low | open | volume
--------+---------------------------------+-----------+-----------+-----------+-----------+-----------+--------
ORCL | 2019-10-29 16:00:00.000000+0000 | 26.160000 | 26.160000 | 26.809999 | 25.629999 | 26.600000 | 181000
ORCL | 2019-10-28 16:00:00.000000+0000 | 26.559999 | 26.559999 | 26.700001 | 22.600000 | 22.900000 | 555000
3.2 CSV data format
Data in CSV format must be converted to the SSTable format before it can be imported. Cassandra provides the `CQLSSTableWriter` tool to generate SSTables from any data format. Because the data requires pre-organization, you must write, compile, and run a custom program to parse the CSV file. The following code is an example of how to use the tool. For the complete tool, see the git repo.
// Prepare SSTable writer
CQLSSTableWriter.Builder builder = CQLSSTableWriter.builder();
// set output directory
builder.inDirectory(outputDir)
// set target schema
.forTable(SCHEMA)
// set CQL statement to put data
.using(INSERT_STMT)
// set partitioner if needed
// default is Murmur3Partitioner so set if you use different one.
.withPartitioner(new Murmur3Partitioner());
CQLSSTableWriter writer = builder.build();
//TODO: Read the CSV file and iterate through each line
while ((line = csvReader.read()) != null)
{
writer.addRow(ticker,
DATE_FORMAT.parse(line.get(0)),
new BigDecimal(line.get(1)),
new BigDecimal(line.get(2)),
new BigDecimal(line.get(3)),
new BigDecimal(line.get(4)),
Long.parseLong(line.get(6)),
new BigDecimal(line.get(5)));
}
writer.close();
After your custom program generates the data in SSTable format, import the data as described in section 3.1.