python - String comparison on elements in large arrays (> 11 mill)

Question

Welcome To Ask or Share your Answers For Others

python - String comparison on elements in large arrays (> 11 mill)

asked Oct 6, 2021 in Technique[技术] by 深蓝 (71.8m points)

python - String comparison on elements in large arrays (> 11 mill)

Trying to compare different environments on id's (string). The largest array is 11,5 mill rows.

What I want to do is to combine all id's from the different arrays and then tell what environment each id is in

What I'm trying to create:

id (all ids of all arrays(df)	prod	preprod	test	dev
first string	True	True	True	False
second string	False	True	False	True
21gdfwg23fge	True	True	False	False
adf23 dfg a2-5	True	False	True	True

question from:https://stackoverflow.com/questions/66061908/string-comparison-on-elements-in-large-arrays-11-mill

与恶龙缠斗过久,自身亦成为恶龙；凝视深渊过久,深渊将回以凝视…

1 Answer

深蓝 · Answer 1 · 2021-10-06T03:05:39+0000

You can use pandas concatenation with index matching, it's very fast.

prod = "aa|b|c|d|e".split('|')
preprod = "c|d|e".split('|')
test = "b|d|e|f".split('|')
dev = "aa|e|g".split('|')

df = pd.concat([
    pd.DataFrame({'prod': 1}, index=np.unique(prod)),
    pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
    pd.DataFrame({'test': 1}, index=np.unique(test)),
    pd.DataFrame({'dev': 1}, index=np.unique(dev))
], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})
print(df)

>>> 
   id  prod  preprod  test  dev
0  aa   1.0      0.0   0.0  1.0
1   b   1.0      0.0   1.0  0.0
2   c   1.0      1.0   0.0  0.0
3   d   1.0      1.0   1.0  0.0
4   e   1.0      1.0   1.0  1.0
5   f   0.0      0.0   1.0  0.0
6   g   0.0      0.0   0.0  1.0

and for speed;

prod = np.random.randint(10000000, size=10000000).astype(str)
preprod = np.random.randint(10000000, size=1000000).astype(str)
test = np.random.randint(10000000, size=1000000).astype(str)
dev = np.random.randint(10000000, size=100000).astype(str)


%%time
df = pd.concat([
    pd.DataFrame({'prod': 1}, index=np.unique(prod)),
    pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
    pd.DataFrame({'test': 1}, index=np.unique(test)),
    pd.DataFrame({'dev': 1}, index=np.unique(dev))
], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})

>>> Wall time: 32.3 s

on my humble laptop.

Categories

python - String comparison on elements in large arrays (> 11 mill)

python - String comparison on elements in large arrays (> 11 mill)

Please log in or register to add a comment.

Please log in or register to answer this question.

1 Answer

Please log in or register to add a comment.

Just Browsing Browsing

Most popular tags