It seems that python would generate the list of arguments first, and then feed the list to the function "f", even using xrange. Is that correct?
Yes, because you're using a list comprehension, which explicitly asks it to generate that list.
(Note that xrange
isn't really relevant here, because you only have two ranges at a time, each 10K long; compared to the 100M of the argument list, that's nothing.)
If you want it to generate the values on the fly as needed, instead of all 100M at once, you want to use a generator expression instead of a list comprehension. Which is almost always just a matter of turning the brackets into parentheses:
x=pool.map(f,((i,j) for i in range(10000) for j in range(10000)))
However, as you can see from the source, map
will ultimately just make a list if you give it a generator, so in this case, that won't solve anything. (The docs don't explicitly say this, but it's hard to see how it could pick a good chunksize to chop the iterable into if it didn't have a length…).
And, even if that weren't true, you'd still just run into the same problem again with the results, because pool.map
returns a list.
To solve both problems, you can use pool.imap
instead. It consumes the iterable lazily, and returns a lazy iterator of results.
One thing to note is that imap
does not guess at the best chunksize if you don't pass one, but just defaults to 1
, so you may need a bit of thought or trial&error to optimize it.
Also, imap
will still queue up some results as they come in, so it can feed them back to you in the same order as the arguments. In pathological cases, it could end up queuing up (poolsize-1)/poolsize of your results, although in practice this is incredibly rare. If you want to solve this, use imap_unordered
. If you need to know the ordering, just pass the indexes back and forth with the args and results:
args = ((i, j) for i in range(10000) for j in range(10000))
def indexed_f(index, (i, j)):
return index, f(i, j)
results = pool.imap_unordered(indexed_f, enumerate(args))
However, I notice that in your original code, you're not doing anything at all with the results of f(i, j)
. In that case, why even bother gathering up the results at all? In that case, you can just go back to the loop:
for i in range(10000):
for j in range(10000):
map.apply_async(f, (i,j))
However, imap_unordered
may still be worth using, because it provides a very easy way to block until all of the tasks are done, while still leaving the pool itself running for later use:
def consume(iterator):
deque(iterator, max_len=0)
x=pool.imap_unordered(f,((i,j) for i in range(10000) for j in range(10000)))
consume(x)