Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
465 views
in Technique[技术] by (71.8m points)

youtube api v3 page tokens

I'm using the search api and using the nextpagetoken to paginate through the results. But I'm not able to retrieve all the results this way. I'm only able to get 500 results out of approximately 455000 results.

Here's the java code to fetch the search results:

youtube = new YouTube.Builder(Auth.HTTP_TRANSPORT, Auth.JSON_FACTORY, new HttpRequestInitializer() {public void initialize(HttpRequest request) throws IOException {}           }).setApplicationName("youtube-search").build();

YouTube.Search.List search = youtube.search().list("id,snippet");
String apiKey = properties.getProperty("youtube.apikey");
search.setKey(apiKey);
search.setType("video");
search.setMaxResults(50);
search.setQ(queryTerm);
boolean allResultsRead = false;
while (! allResultsRead){
SearchListResponse searchResponse = search.execute();
System.out.println("Printed " +  searchResponse.getPageInfo().getResultsPerPage() + " out of " + searchResponse.getPageInfo().getTotalResults() + ". Current page token: " + search.getPageToken() + "Next page token: " + searchResponse.getNextPageToken() + ". Prev page token" + searchResponse.getPrevPageToken());
if (searchResponse.getNextPageToken() == null)
{
    allResultsRead = true;                          
    search = youtube.search().list("id,snippet");
    search.setKey(apiKey);
    search.setType("video");
    search.setMaxResults(50);
}
else
{
   search.setPageToken(searchResponse.getNextPageToken());
}}

The output is

Printed 50 out of 455085. Current page token: null Next page token: CDIQAA. Prev page token null
Printed 50 out of 454983. Current page token: CDIQAA Next page token: CGQQAA. Prev page token CDIQAQ
Printed 50 out of 455081. Current page token: CGQQAA Next page token: CJYBEAA. Prev page token CGQQAQ
Printed 50 out of 454981. Current page token: CJYBEAA Next page token: CMgBEAA. Prev page token CJYBEAE
Printed 50 out of 455081. Current page token: CMgBEAA Next page token: CPoBEAA. Prev page token CMgBEAE
Printed 50 out of 454981. Current page token: CPoBEAA Next page token: CKwCEAA. Prev page token CPoBEAE
Printed 50 out of 455081. Current page token: CKwCEAA Next page token: CN4CEAA. Prev page token CKwCEAE
Printed 50 out of 454980. Current page token: CN4CEAA Next page token: CJADEAA. Prev page token CN4CEAE
Printed 50 out of 455081. Current page token: CJADEAA Next page token: CMIDEAA. Prev page token CJADEAE
Printed 50 out of 455081. Current page token: CMIDEAA Next page token: null. Prev page token CMIDEAE

After 10 iterations through the while loop, it exits because the next page token is null.

I'm new to the Yotube API and not sure what I'm doing wrong here. I have two questions: 1. How do I get all the results? 2. Why is the previous page token for page 3 not the same as the current token of page 2?

Any help will be appreciated. Thanks!

See Question&Answers more detail:os

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You're experiencing what is intended; using the nextPageToken, you can only get up to 500 results. If you're interested in the development of how this came about, you could read through this thread:

https://code.google.com/p/gdata-issues/issues/detail?id=4282

But as a summary of that thread, it basically comes down to the fact that, with so much data on YouTube, the search algorithms are radically different than most people think they are. This isn't just just doing simple database searching for content in fields, but there are an incredible number of signals that are being processed to make the results relevant, and after about 500 results the algorithms start to lose the ability to make the results worthwhile.

One thing that has helped me wrap my mind around this is to realize that when YouTube talks about search, they are talking about probability rather than matching, so the results are ordered, based on your parameters, in terms of their likelihood to be relevant to your query. As you paginate through, then, you eventually reach a point where, statistically speaking, the probability of relevance is low enough that it isn't computationally worth it to allow those results to come back. So 500 is the decided upon limit.

(Also note that the number of "results" isn't an approximation of matches, it's an approximation of potential matches, but then as you start to retrieve them many of those possible matches get cast aside as not being relevant at all ... so that number doesn't really mean what people think it does. Google search is the same way.)

You might wonder why YouTube search functions in this way rather than doing more traditional string/data matching; with so much search volume, if they were to actually do a complete search of all the data for every query, you'd be waiting minutes at a time if not more. It's really a technical marvel, if you think about it, how the algorithms are able to get such relevant results for the top 500 cases when they're functioning on prediction, probability, and such.

As to your second question, the page tokens don't represent a unique set of results but instead represent a sort of algorithmic state, and are thus pointers to your query, the progress of the query, and the direction of the query ... so iteration 3, for example, is referenced by both the nextPageToken of iteration 2 and the prevPageToken of iteration 4, but those two tokens are slightly different so they can indicate the direction they came from.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...