Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
100 views
in Technique[技术] by (71.8m points)

python - String comparison on elements in large arrays (> 11 mill)

Trying to compare different environments on id's (string). The largest array is 11,5 mill rows.

What I want to do is to combine all id's from the different arrays and then tell what environment each id is in

What I'm trying to create:

id (all ids of all arrays(df) prod preprod test dev
first string True True True False
second string False True False True
21gdfwg23fge True True False False
adf23 dfg a2-5 True False True True
question from:https://stackoverflow.com/questions/66061908/string-comparison-on-elements-in-large-arrays-11-mill

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

You can use pandas concatenation with index matching, it's very fast.

prod = "aa|b|c|d|e".split('|')
preprod = "c|d|e".split('|')
test = "b|d|e|f".split('|')
dev = "aa|e|g".split('|')

df = pd.concat([
    pd.DataFrame({'prod': 1}, index=np.unique(prod)),
    pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
    pd.DataFrame({'test': 1}, index=np.unique(test)),
    pd.DataFrame({'dev': 1}, index=np.unique(dev))
], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})
print(df)

>>> 
   id  prod  preprod  test  dev
0  aa   1.0      0.0   0.0  1.0
1   b   1.0      0.0   1.0  0.0
2   c   1.0      1.0   0.0  0.0
3   d   1.0      1.0   1.0  0.0
4   e   1.0      1.0   1.0  1.0
5   f   0.0      0.0   1.0  0.0
6   g   0.0      0.0   0.0  1.0

and for speed;

prod = np.random.randint(10000000, size=10000000).astype(str)
preprod = np.random.randint(10000000, size=1000000).astype(str)
test = np.random.randint(10000000, size=1000000).astype(str)
dev = np.random.randint(10000000, size=100000).astype(str)


%%time
df = pd.concat([
    pd.DataFrame({'prod': 1}, index=np.unique(prod)),
    pd.DataFrame({'preprod': 1}, index=np.unique(preprod)),
    pd.DataFrame({'test': 1}, index=np.unique(test)),
    pd.DataFrame({'dev': 1}, index=np.unique(dev))
], axis=1, sort=False).fillna(0).reset_index().rename(columns={'index': 'id'})

>>> Wall time: 32.3 s

on my humble laptop.


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...