Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
242 views
in Technique[技术] by (71.8m points)

cql - Cassandra: List 10 most recently modified records

I'm having trouble trying to model my data such that I can efficiently query Cassandra for the last 10 (any number actually) records that were most recently modified. Each record has a last_modified_date column that is set by the application when inserting/updating the record.

I've excluded the data columns from this example code.

Main data table (contains only one row per record):

CREATE TABLE record (
    record_id int,
    last_modified_by text,
    last_modified_date timestamp,
    PRIMARY KEY (record_id)
);

Solution 1 (Fail)

I tried to create a separate table, which used a clustering key order.

Table (one row for each record; only inserting the last modified date):

CREATE TABLE record_by_last_modified_index (
    record_id int,
    last_modified_by text,
    last_modified_date timestamp,
    PRIMARY KEY (record_id, last_modified_date)
) WITH CLUSTERING ORDER BY (last_modified_date DESC);

Query:

SELECT * FROM record_by_last_modified_index LIMIT 10

This solution does not work since the clustering order only applies to the ordering of records with the same partition key. Since each row has a different partition key (record_id) the query results don't include the expected records.

Solution 2 (Inefficient)

Another solution I have tried is to simply query Cassandra for all record_id and last_modified_date values, sort them and pick the first 10 records in my application. This is clearly inefficient and won't scale well.

Solution 3

One last solution, which I considered is using the same partition key for all records and using clustering order to ensure records are sorted correctly. The problem with that solution is that the data will not be correctly partitioned across the nodes since all of the records would have the same partition key. That seems like a non-starter to me.

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

I think what you're trying to do is more of a relational database model and is somewhat of an anti-pattern in Cassandra.

Cassandra only sorts things based on clustering columns, but the sort order isn't expected to change. This is because when memtables are written to disk as SSTables (Sorted String Tables), the SSTables are immutable and can't be re-sorted efficiently. This is why you aren't allowed to update the value of a clustering column.

If you want to re-sort the clustered rows, the only way I know is to delete the old row and insert a new one in a batch. To make that even more inefficient, you would probably need to first do a read to figure out what the last_modified_date was for the record_id so that you could delete it.

So I'd look for a different approach, such as just writing the updates as new clustered rows and leave the old ones there (possibly clean them up over time using a TTL). So your newest updates would always be on top when you did a LIMIT query.

In terms of partitioning, you will need to break your data into a few categories to spread the data over your nodes. That means you won't get global sorting of your table, but only within categories, which is due to the distributed model. If you really need global sorting, then perhaps look at something like pairing Cassandra with Spark. Sorting is super expensive in time and resources, so think carefully if you really need it.

Update:

Thinking about this some more, you should be able to do this in Cassandra 3.0 using materialized views. The view would take care of the messy delete and insert for you, to re-order the clustered rows. So here's what it looks like in the 3.0 alpha release:

First create the base table:

CREATE TABLE record_ids (
    record_type int,
    last_modified_date timestamp,
    record_id int,
    PRIMARY KEY(record_type, record_id));

Then create a view of that table, using last_modified_date as a clustering column:

CREATE MATERIALIZED VIEW last_modified AS
    SELECT record_type FROM record_ids
    WHERE record_type IS NOT NULL AND last_modified_date IS NOT NULL AND record_id IS NOT NULL
    PRIMARY KEY (record_type, last_modified_date, record_id)
    WITH CLUSTERING ORDER BY (last_modified_date DESC);

Now insert some records:

insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 100);
insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 200);
insert into record_ids (record_type, last_modified_date, record_id) VALUES ( 1, dateof(now()), 300);

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:41:25+0000
           1 |       300 | 2015-08-14 19:41:41+0000

SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:25+0000 |       200
           1 | 2015-08-14 19:41:10+0000 |       100

Now we update a record in the base table, and should see it move to the top of the list in the view:

UPDATE record_ids SET last_modified_date = dateof(now()) 
WHERE record_type=1 AND record_id=200;

So in the base table, we see the timestamp for record_id=200 was updated:

SELECT * FROM record_ids;

 record_type | record_id | last_modified_date
-------------+-----------+--------------------------
           1 |       100 | 2015-08-14 19:41:10+0000
           1 |       200 | 2015-08-14 19:43:13+0000
           1 |       300 | 2015-08-14 19:41:41+0000

And in the view, we see:

 SELECT * FROM last_modified;

 record_type | last_modified_date       | record_id
-------------+--------------------------+-----------
           1 | 2015-08-14 19:43:13+0000 |       200
           1 | 2015-08-14 19:41:41+0000 |       300
           1 | 2015-08-14 19:41:10+0000 |       100

So you see that record_id=200 moved up in the view and if you do a limit N on that table, you'd get the N most recently modified rows.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...