Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Welcome To Ask or Share your Answers For Others

Categories

0 votes
1.5k views
in Technique[技术] by (71.8m points)

python 3.x - How to make Pandas dataframe run on multiple cores?

I have around 5 million rows being processed in pandas dataframe & its taking about 45 minutes.There are multiple columns, data transformation involves copying some columns to new column, changing values in some as per given logic and many such transformations. Since pandas runs only on one core at a time, my other nodes are remaining idle. How do i make pandas utilize all available nodes & reduce the total transformation time ?


与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome To Ask or Share your Answers For Others

1 Answer

0 votes
by (71.8m points)

A package that parallelizes Pandas over multiple CPU cores is modin. Read more about it at its official documentation page.

The way to go ahead would be to use ray as the backend. It is installed via pip as follows:

pip install modin[ray]

In case you have a preference for dask, you can install it as:

pip install modin[dask]

To use the package, just change the import, and ~88 % of the functions you need are available:

import modin.pandas as pd
import numpy as np

frame_data = np.random.randint(0, 100, size=(2**10, 2**8))
df = pd.DataFrame(frame_data)

与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…
Welcome to OStack Knowledge Sharing Community for programmer and developer-Open, Learning and Share
Click Here to Ask a Question

...