BulkLoad data import

Preparations

This tool uses a file flow interface to quickly import data into a Cassandra cluster. It is one of the fastest methods to migrate offline data to an online Cassandra cluster. The following preparations are required:

An online Cassandra cluster
Offline data in SSTable or CSV format.
A separate ECS instance in the same VPC. Configure the security group to allow access to the Cassandra cluster.

1. Prepare a client ECS instance in the same VPC

Use a separate ECS instance instead of an instance from the online Cassandra cluster. This prevents the import process from affecting the online service.

2. Create a schema

$ cqlsh -f schema.cql  -u USERNAME -p PASSWORD [host]

3. Prepare the data

3.1 SSTable data format

Organize the folders in the `data/${keyspace}/${table}` format and place the SSTable data in the corresponding folder. For example:

ls /tmp/quote/historical_prices/
md-1-big-CompressionInfo.db md-1-big-Data.db        md-1-big-Digest.crc32       md-1-big-Filter.db      md-1-big-Index.db       md-1-big-Statistics.db      md-1-big-Summary.db     md-1-big-TOC.txt

In the example above, the keyspace is `quota` and the table is `historical_prices`.

Import the data

Run the `sstableloader` command, which is located in the `bin` folder of the Cassandra distribution package. Specify the `data/${ks}/${table}` data folder.

${cassandra_home}/bin/sstableloader -d <ip address of the node> data/${ks}/${table}

Wait for the SSTable data import to complete. Then, use `cqlsh` to verify the data.

$ bin/cqlsh 
cqlsh> select * from quote.historical_prices;

 ticker | date                            | adj_close | close     | high      | low       | open      | volume
--------+---------------------------------+-----------+-----------+-----------+-----------+-----------+--------
   ORCL | 2019-10-29 16:00:00.000000+0000 | 26.160000 | 26.160000 | 26.809999 | 25.629999 | 26.600000 | 181000
   ORCL | 2019-10-28 16:00:00.000000+0000 | 26.559999 | 26.559999 | 26.700001 | 22.600000 | 22.900000 | 555000

3.2 CSV data format

Data in CSV format must be converted to the SSTable format before it can be imported. Cassandra provides the `CQLSSTableWriter` tool to generate SSTables from any data format. Because the data requires pre-organization, you must write, compile, and run a custom program to parse the CSV file. The following code is an example of how to use the tool. For the complete tool, see the git repo.

        // Prepare SSTable writer
        CQLSSTableWriter.Builder builder = CQLSSTableWriter.builder();
        // set output directory
        builder.inDirectory(outputDir)
               // set target schema
               .forTable(SCHEMA)
               // set CQL statement to put data
               .using(INSERT_STMT)
               // set partitioner if needed
               // default is Murmur3Partitioner so set if you use different one.
               .withPartitioner(new Murmur3Partitioner());
        CQLSSTableWriter writer = builder.build();
        //TODO: Read the CSV file and iterate through each line
        while ((line = csvReader.read()) != null)
                {
                    writer.addRow(ticker,
                                  DATE_FORMAT.parse(line.get(0)),
                                  new BigDecimal(line.get(1)),
                                  new BigDecimal(line.get(2)),
                                  new BigDecimal(line.get(3)),
                                  new BigDecimal(line.get(4)),
                                  Long.parseLong(line.get(6)),
                                  new BigDecimal(line.get(5)));
                }
                writer.close();

After your custom program generates the data in SSTable format, import the data as described in section 3.1.