Yes, algorithm could be improved. You are recalculating user list for items in inside loop multiple times. You can just get a dictionary of item and their users outside loops.
# get unique items
items = set(main.productId)
n_users = len(set(main.userId))
# make a dictionary of item and users who bought that item
item_users = main.groupby('productId')['userId'].apply(set).to_dict()
# iterate over combinations of item1 and item2 and store scores
result = []
for item1, item2 in itertools.combinations(items, 2):
score = len(item_users[item1] & item_users[item2]) / n_users
item_tuples = [(item1, item2), (item2, item1)]
result.append((item1, item2, score))
result.append((item2, item1, score)) # store score for reverse order as well
# convert results to a dataframe
result = pd.DataFrame(result, columns=["item1", "item2", "score"])
Timing differences:
Original implementation from question
# 3 loops, best of 3: 41.8 ms per loop
Mark's Method 2
# 3 loops, best of 3: 19.9 ms per loop
Implementation in this answer
# 3 loops, best of 3: 3.01 ms per loop
与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…