Pandas DataFrame的创建方法
通过list、numpy创建
import pandas as pd
from datetime import datetime
import numpy as np
dates = [datetime(2011,1,2), datetime(2011,2,5), datetime(2011,3,5), datetime(2011,4,5),
datetime(2011,5,7), datetime(2011,6,8), datetime(2011,7,5), datetime(2011,8,5),
datetime(2011,9,10), datetime(2011,10,12), datetime(2011,11,5), datetime(2011,12,5)
]
df = pd.DataFrame(dates,columns=['date'])
# out
date
0 2011-01-02
1 2011-02-05
2 2011-03-05
3 2011-04-05
4 2011-05-07
5 2011-06-08
6 2011-07-05
7 2011-08-05
8 2011-09-10
9 2011-10-12
10 2011-11-05
11 2011-12-05
Pandas 文本读取与存储
#读取csv
df = pd.read_csv('xxx.csv')
# 存储csv
df.to_csv( 'xxx.csv',index=None)
#pkl格式
df.to_pickle('xxx.pkl') #格式另存
df = pd.read_pickle('xxx.pkl') #读取
#hdf格式
df.to_hdf('xxx.hdf','df') #格式另存
df = pd.read_hdf('xxx.pkl','df') #读取
pkl格式的数据的读取速度最快,是读取csv格式数据的近6倍,其次是hdf格式的数据。
对于日常的数据集(大多为csv格式),可以先用pandas读入,然后将数据转存为pkl或者hdf格式,之后每次读取数据时候,便可以节省一些时间。
Pandas时间序列处理
创建
pd.date_range
# in df1
t1_range = pd.date_range('2020-01-09', periods=4, freq='1D20min')
df1 = pd.DataFrame(t1_range,columns=['date'])
# out df1
date
0 2020-01-09 00:00:00
1 2020-01-10 00:20:00
2 2020-01-11 00:40:00
3 2020-01-12 01:00:00
# in df2
t2_range = pd.date_range('2020-01-09', '2020-12-09', freq='1D20min')
df2 = pd.DataFrame(t2_range,columns=['date'])
# out df2
date
0 2020-01-09 00:00:00
1 2020-01-10 00:20:00
2 2020-01-11 00:40:00
3 2020-01-12 01:00:00
4 2020-01-13 01:20:00
... ...
326 2020-12-04 12:40:00
327 2020-12-05 13:00:00
328 2020-12-06 13:20:00
329 2020-12-07 13:40:00
330 2020-12-08 14:00:00
# in df3
t3_range = pd.date_range('2020-1-1', '2020-12-31')
df3 = pd.DataFrame(t3_range,columns=['date'])
# out df3
date
0 2020-01-01
1 2020-01-02
2 2020-01-03
3 2020-01-04
4 2020-01-05
... ...
361 2020-12-27
362 2020-12-28
363 2020-12-29
364 2020-12-30
365 2020-12-31
筛选
dt.between
借助时间序列的dt属性,接受起始和结束参数,实现特定范围筛选
筛选日期date
# in
df2.date.dt.time
# out
0 00:00:00
1 00:20:00
2 00:40:00
3 01:00:00
4 01:20:00
...
326 12:40:00
327 13:00:00
328 13:20:00
329 13:40:00
330 14:00:00
Name: date, Length: 331, dtype: object
# in
df2[df2['date'].dt.date.between(pd.Timestamp( "2020-05-09").date(),pd.Timestamp( '2020-05-12').date())]
# out
date
36 2020-02-14 12:00:00
37 2020-02-15 12:20:00
108 2020-04-27 12:00:00
109 2020-04-28 12:20:00
180 2020-07-09 12:00:00
181 2020-07-10 12:20:00
252 2020-09-20 12:00:00
253 2020-09-21 12:20:00
324 2020-12-02 12:00:00
325 2020-12-03 12:20:00
筛选时间time
# in
df2.date.dt.time
# out
0 00:00:00
1 00:20:00
2 00:40:00
3 01:00:00
4 01:20:00
...
326 12:40:00
327 13:00:00
328 13:20:00
329 13:40:00
330 14:00:00
Name: date, Length: 331, dtype: object
# in
df2[df2['date'].dt.time.between(pd.Timestamp( '12:00').time(),pd.Timestamp( '12:20').time())]
# out
date
36 2020-02-14 12:00:00
37 2020-02-15 12:20:00
108 2020-04-27 12:00:00
109 2020-04-28 12:20:00
180 2020-07-09 12:00:00
181 2020-07-10 12:20:00
252 2020-09-20 12:00:00
253 2020-09-21 12:20:00
324 2020-12-02 12:00:00
325 2020-12-03 12:20:00
筛选日期时间datetime
# in
df2.date.dt.datetime
# out
AttributeError Traceback (most recent call last)
<ipython-input-49-74cb5d03c1f4> in <module>
16
17
---> 18 df2.date.dt.datetime
AttributeError: 'DatetimeProperties' object has no attribute 'datetime'
发现没有datetime,因此不能同时筛选日期和时间,需要依次分开来筛选达到目的。
根据时间列直接进行判断
# in
# 注意&不能改为and
df2[(df2['date'] >= pd.Timestamp( "2020-05-09")) & (df2['date'] <= pd.Timestamp( '2020-05-12'))]
# out
date
120 2020-05-09 16:00:00
121 2020-05-10 16:20:00
122 2020-05-11 16:40:00
提取出时间/日期的属性
当数据中的时间列已经转换为datetime64
格式时,仅需调用.dt
接口,即可快速求得想要的结果,下表中列出了.dt
接口所提供的常见属性:
df.date.dt.quarter //季度
df.date.dt.month //月份
如果想把时间序列按季节分类怎么实现呢?季度和季节一般是不一样的,季度一般是从123月开始,共四个季度,我国春季从345月开始,跟季度对不上,因此不能用quarter,可以通过下面方法实现
df[(df.date.dt.month).isin([3,4,5])]
# out
date
2 2011-03-05
3 2011-04-05
4 2011-05-07
本文地址:https://blog.csdn.net/wq_ocean_/article/details/109924405
黄山市民网:https://www.huangshanshimin.com/