This has been driving me nuts all week weekend. I am trying merge data for different assets around a common timestamp. Each asset's data is a value in dictionary. The data of interest is stored in lists in one column, so this needs to be separated first. Here is a sample of the unprocessed df:
data['Dogecoin'].head()
market_cap_by_available_supply price_btc price_usd volume_usd
0 [1387118554000, 3488670] [1387118554000, 6.58771e-07] [1387118554000, 0.000558776] [1387118554000, 0.0]
1 [1387243928000, 1619159] [1387243928000, 3.18752e-07] [1387243928000, 0.000218176] [1387243928000, 0.0]
2 [1387336027000, 2191987] [1387336027000, 4.10802e-07] [1387336027000, 0.000267749] [1387336027000, 0.0]
Then I apply this function to separate market_cap_by_available_supply, saving it's components into a new dataframe for that asset:
data2 = {}
#sorting function
for coin in data:
#seperates timestamp and marketcap from their respective list inside each element
TS = data[coin].market_cap_by_available_supply.map(lambda r: r[0])
cap = data[coin].market_cap_by_available_supply.map(lambda r: r[1])
#Creates DataFrame and stores timestamp and marketcap data into dictionairy
df = DataFrame(columns=['timestamp','cap'])
df.timestamp = TS
df.cap = cap
df.columns = ['timestamp',str(coin)+'_cap']
data2[coin] = df
#converts timestamp into datetime 'yyy-mm-dd'
data2[coin]['timestamp'] = pd.to_datetime(data2[coin]['timestamp'], unit='ms').dt.date
It seemed to work perfectly, producing the correct data, sample:
data2['Namecoin'].head()
timestamp Namecoin_cap
0 2013-04-28 5969081
1 2013-04-29 7006114
2 2013-04-30 7049003
3 2013-05-01 6366350
4 2013-05-02 5848626
However when I attempted to merge all dataframes, I got a memory error, I've spent hours trying to figure out the root and it seems like the 'sorting function' above is increasing the size of the dataframe from 12Mb to 131Mb! It should do this opposite. Any ideas ?
On a side note here is the data
https://www.mediafire.com/?9pcwroe1x35nnwl
I open it with this pickle funtion
with open("CMC_no_tuple_data.pickle", "rb") as myFile:
data = pickle.load(myFile)
EDIT: Sorry for the typo in the pickle file name.
@Goyo to compute the size i simple saved data and data 2 via pickle.dump and looked at their respective sizes
@ Padraic Cunningham you used the sort funtion i provided and it produced a smaller file ?? This is not my case and i get a memory error when trying to merge the dataframes
See Question&Answers more detail:
os