Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
300 views
in Technique[技术] by (71.8m points)

mysql - Cosmos DB : Faster Search Options

We have huge cosmosDB container with billions of rows and almost 300 columns. Data is partitioned and modeled in a way we query it most of the time.

For example : User table is partitioned by userId thats why below query works fine.

Select * from User where userId = "user01234"

But in some cases, we need to query data differently that need sorting and then query.

For example : Get data from User Table using userpost and date of post

Select * from user where userPostId = "P01234" orderBy date limit 100

This query takes lot of time because of the size of data and data is not partitioned based on query2 (user Post).

My question is - How can we make query2 and other similar queries faster when data is not partitioned accordingly.

Option 1: "Create separate collection which is partitioned as per Query2" - This will make query faster but for any new query we will end up creating a new collection, which is duplication of billions of records. [Costly Option]

Option 2: "Build elastic search on top of DB?" This is time consuming option and may be over killing for this slow query problem.

Is there any other option that can be used? Let me know your thoughts.

Thanks in advance!


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

Both options are expensive. The key is deciding which is cheaper, including running the cross-partition query. This will require you costing each of these options out.

For the cross-partition query, capture the RU charge in the response object so you know the cost of it.

For change feed, this will have an upfront cost as you run it over your existing collection, but whether that cost remains high depends on how much data is inserted or updated each month. Calculating the cost to populate your second collection will take some work. You can start by measuring the RU Charge in the response object when doing an insert then multiply by the number of rows. Calculating how much throughput you'll need will be a function of how quickly you want to populate your second collection. It's also a function of how much compute and how many instances you use to read and write the data to the second collection.

Once the second collection is populated, Change Feed will cost 2 RU/s to poll for changes (btw, this is configurable) and 1 RU/s to read each new item. The cost of inserting data into a second collection costs whatever it is when you measured it earlier.

If this second query doesn't get run that often and your data doesn't change that much, then change feed could save you money. If you run this query a lot and your data changes frequently too, change feed could still save you money.

With regards to Elastic Search or Azure Search, I generally find this can be more expensive than keeping the cross-partition query or change feed. Especially if you're doing it to just answer a second query. Generally this is a better option when you need true free text query capabilities.

A third option you might explore is using Azure Synapse Link and then run both queries using SQL Serverless or Spark.

Some other observations.

Unless you need all 300 properties in these queries you run, you may want to consider shredding these items into separate documents and storing as separate rows. Especially if you have highly asymmetric update patterns where only a small number of properties get frequently updated. This will save you a ton of money on updates because the smaller the item you update, the cheaper (and faster) it will be.

The other thing I would suggest is to look at your index policy and exclude every property that is not used in the where clause for your queries and include properties that are. This will have a dramatic impact on RU consumption for inserts. Also take a look at composite index for your date property as this has a dramatic impact on queries that use order by.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...