Pandas DataFrame的创建方法

通过list、numpy创建

import  pandas as pd
from datetime import datetime
import numpy as np


dates = [datetime(2011,1,2), datetime(2011,2,5), datetime(2011,3,5), datetime(2011,4,5),
        datetime(2011,5,7), datetime(2011,6,8), datetime(2011,7,5), datetime(2011,8,5),
        datetime(2011,9,10), datetime(2011,10,12), datetime(2011,11,5), datetime(2011,12,5)
         ]
df = pd.DataFrame(dates,columns=['date'])

# out
	date
0	2011-01-02
1	2011-02-05
2	2011-03-05
3	2011-04-05
4	2011-05-07
5	2011-06-08
6	2011-07-05
7	2011-08-05
8	2011-09-10
9	2011-10-12
10	2011-11-05
11	2011-12-05

Pandas 文本读取与存储

#读取csv
df = pd.read_csv('xxx.csv')
# 存储csv
df.to_csv( 'xxx.csv',index=None)

#pkl格式
df.to_pickle('xxx.pkl') #格式另存
df = pd.read_pickle('xxx.pkl') #读取
 
#hdf格式
df.to_hdf('xxx.hdf','df') #格式另存
df = pd.read_hdf('xxx.pkl','df') #读取

pkl格式的数据的读取速度最快,是读取csv格式数据的近6倍,其次是hdf格式的数据。

对于日常的数据集(大多为csv格式),可以先用pandas读入,然后将数据转存为pkl或者hdf格式,之后每次读取数据时候,便可以节省一些时间。

Pandas时间序列处理

创建

pd.date_range

# in df1
t1_range = pd.date_range('2020-01-09', periods=4, freq='1D20min')
df1 = pd.DataFrame(t1_range,columns=['date'])
# out df1
date
0	2020-01-09 00:00:00
1	2020-01-10 00:20:00
2	2020-01-11 00:40:00
3	2020-01-12 01:00:00

# in df2
t2_range = pd.date_range('2020-01-09', '2020-12-09', freq='1D20min')
df2 = pd.DataFrame(t2_range,columns=['date'])
# out df2
date
0	2020-01-09 00:00:00
1	2020-01-10 00:20:00
2	2020-01-11 00:40:00
3	2020-01-12 01:00:00
4	2020-01-13 01:20:00
...	...
326	2020-12-04 12:40:00
327	2020-12-05 13:00:00
328	2020-12-06 13:20:00
329	2020-12-07 13:40:00
330	2020-12-08 14:00:00

# in df3
t3_range = pd.date_range('2020-1-1', '2020-12-31')
df3 = pd.DataFrame(t3_range,columns=['date'])
# out df3
date
0	2020-01-01
1	2020-01-02
2	2020-01-03
3	2020-01-04
4	2020-01-05
...	...
361	2020-12-27
362	2020-12-28
363	2020-12-29
364	2020-12-30
365	2020-12-31

筛选

dt.between

借助时间序列的dt属性,接受起始和结束参数,实现特定范围筛选

筛选日期date

# in
df2.date.dt.time
# out
0      00:00:00
1      00:20:00
2      00:40:00
3      01:00:00
4      01:20:00
         ...   
326    12:40:00
327    13:00:00
328    13:20:00
329    13:40:00
330    14:00:00
Name: date, Length: 331, dtype: object

# in 
df2[df2['date'].dt.date.between(pd.Timestamp( "2020-05-09").date(),pd.Timestamp( '2020-05-12').date())]
# out
date
36	2020-02-14 12:00:00
37	2020-02-15 12:20:00
108	2020-04-27 12:00:00
109	2020-04-28 12:20:00
180	2020-07-09 12:00:00
181	2020-07-10 12:20:00
252	2020-09-20 12:00:00
253	2020-09-21 12:20:00
324	2020-12-02 12:00:00
325	2020-12-03 12:20:00

筛选时间time

# in
df2.date.dt.time
# out
0      00:00:00
1      00:20:00
2      00:40:00
3      01:00:00
4      01:20:00
         ...   
326    12:40:00
327    13:00:00
328    13:20:00
329    13:40:00
330    14:00:00
Name: date, Length: 331, dtype: object

# in 
df2[df2['date'].dt.time.between(pd.Timestamp( '12:00').time(),pd.Timestamp( '12:20').time())]
# out
date
36	2020-02-14 12:00:00
37	2020-02-15 12:20:00
108	2020-04-27 12:00:00
109	2020-04-28 12:20:00
180	2020-07-09 12:00:00
181	2020-07-10 12:20:00
252	2020-09-20 12:00:00
253	2020-09-21 12:20:00
324	2020-12-02 12:00:00
325	2020-12-03 12:20:00

筛选日期时间datetime

# in
df2.date.dt.datetime
# out
AttributeError                            Traceback (most recent call last)
<ipython-input-49-74cb5d03c1f4> in <module>
     16 
     17 
---> 18 df2.date.dt.datetime

AttributeError: 'DatetimeProperties' object has no attribute 'datetime'

发现没有datetime,因此不能同时筛选日期和时间,需要依次分开来筛选达到目的。

根据时间列直接进行判断

# in
# 注意&不能改为and
df2[(df2['date'] >= pd.Timestamp( "2020-05-09")) & (df2['date'] <= pd.Timestamp( '2020-05-12'))]
# out
date
120	2020-05-09 16:00:00
121	2020-05-10 16:20:00
122	2020-05-11 16:40:00

提取出时间/日期的属性

当数据中的时间列已经转换为datetime64格式时,仅需调用.dt接口,即可快速求得想要的结果,下表中列出了.dt接口所提供的常见属性:

df.date.dt.quarter   //季度
df.date.dt.month    //月份

如果想把时间序列按季节分类怎么实现呢?季度和季节一般是不一样的,季度一般是从123月开始,共四个季度,我国春季从345月开始,跟季度对不上,因此不能用quarter,可以通过下面方法实现

df[(df.date.dt.month).isin([3,4,5])]

# out
date
2	2011-03-05
3	2011-04-05
4	2011-05-07

 

 

本文地址:https://blog.csdn.net/wq_ocean_/article/details/109924405