lundi 29 juin 2015

Using DataFrame to get matrix of identifiers


I am an R user who is new to python. I have some data

dat1=DataFrame({'user_id':['a1','a1','a4','a3','a1','a15', 'a8', 'a15'      ,'a1', 'a5'],
 'Visits':[1,4,2,1,3,1,1,8,1,9],'cell': [14,21,14,14,19,10,18,17,10,11], 
 'date': ['2011-01-05', '2011-01-05', '2011-01-12', '2011-01-12', '2011-01-12',   '2011-01-12', '2011-01-02', '2011-01-19', '2011-01-19', '2011-01-19' ] })




 dat1['date']=pd.to_datetime(dat1['date'])

 dat2=dat1.sort_index(by='date')    

This gives me a DataFrame of the form

Visits  cell     date     user_id
   1    18   2011-01-02      a8
   1    14   2011-01-05      a1
   4    21   2011-01-05      a1
   2    14   2011-01-12      a4
   1    14   2011-01-12      a3
   3    19   2011-01-12      a1
   1    10   2011-01-12     a15
   8    17   2011-01-19     a15
   1    10   2011-01-19      a1
   9    11   2011-01-19      a5

I want to create a DataFrame such that each column is identified with a unique user_id and each row is a unique date. Each cell contains a 0 or 1 depending on whether the user_id and the date share a row in the original DataFrame. In R

I would use sapply and a user defined function for this operation, but in Python I am struggling to find a solution.

With my array of user_ids denoted as

user_names= dat2['user_id'].unique()

My final DataFrame should be of the form

a8 a1 a4 a3 a15 a5
1  0  0  0  0  0
0  1  0  0  0  0
0  1  1  1  1  0
0  1  0  0  1  1


Aucun commentaire:

Enregistrer un commentaire