大家好，又见面了，我是你们的朋友全栈君。如果您正在找激活码,请点击查看最新教程,关注关注公众号 “全栈程序员社区” 获取激活教程,可能之前旧版本教程已经失效.最新Idea2022.1教程亲测有效,一键激活。

Jetbrains全系列IDE使用 1年只要46元售后保障童叟无欺

创建数据

随机数据

创建一个Series，pandas可以生成一个默认的索引

s = pd.Series([1,3,5,np.nan,6,8])

通过numpy创建DataFrame，包含一个日期索引，以及标记的列

dates = pd.date_range('20170101', periods=6)
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list('ABCD'))

df
Out[4]: 
                   A         B         C         D
2016-10-10  0.630275  1.081899 -1.594402 -2.571683
2016-10-11 -0.211379 -0.166089 -0.480015 -0.346706
2016-10-12 -0.416171 -0.640860  0.944614 -0.756651
2016-10-13  0.652248  0.186364  0.943509  0.053282
2016-10-14 -0.430867 -0.494919 -0.280717 -1.327491
2016-10-15  0.306519 -2.103769 -0.019832  0.035211

其中，np.random.randn可以返回一个随机数组

通过Dict创建

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })

Out[20]: 
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo

创建TimeStamp

有几个方法可以构造一个Timestamp对象

pd.Timestamp

import pandas as pd
from datetime import datetime as dt
p1=pd.Timestamp(2017,6,19)
p2=pd.Timestamp(dt(2017,6,19,hour=9,minute=13,second=45))
p3=pd.Timestamp("2017-6-19 9:13:45")

print("type of p1:",type(p1))
print(p1)
print("type of p2:",type(p2))
print(p2)
print("type of p3:",type(p3))
print(p3)


('type of p1:', <class 'pandas.tslib.Timestamp'>)
2017-06-19 00:00:00
('type of p2:', <class 'pandas.tslib.Timestamp'>)
2017-06-19 09:13:45
('type of p3:', <class 'pandas.tslib.Timestamp'>)
2017-06-19 09:13:45

to_datetime()

import pandas as pd
from datetime import datetime as dt

p4=pd.to_datetime("2017-6-19 9:13:45")
p5=pd.to_datetime(dt(2017,6,19,hour=9,minute=13,second=45))

print("type of p4:",type(p4))
print(p4)
print("type of p5:",type(p5))
print(p5)

('type of p4:', <class 'pandas.tslib.Timestamp'>)
2017-06-19 09:13:45
('type of p5:', <class 'pandas.tslib.Timestamp'>)
2017-06-19 09:13:45

读取数据

读取csv

df = pd.read_csv('x.csv')

读取压缩包

import zipfile

with zipfile.ZipFile('x.csv.zip', 'r') as z:
    f = z.open('x.csv')
    df = pd.read_csv(f, header=0)

查看数据

参考Basics section

查看数据类型

df2.dtypes

Out[30]: 
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

查看head和tail

df.head(1)
df.tail(3)

查看index、column和数据

df.index
df.columns
df.values

显示数据的快速统计

df.describe()
Out[19]: 
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.073711 -0.431125 -0.687758 -0.233103
std    0.843157  0.922818  0.779887  0.973118
min   -0.861849 -2.104569 -1.509059 -1.135632
25%   -0.611510 -0.600794 -1.368714 -1.076610
50%    0.022070 -0.228039 -0.767252 -0.386188
75%    0.658444  0.041933 -0.034326  0.461706
max    1.212112  0.567020  0.276232  1.071804

筛选数据

转置

df.T

遍历

traj_plot.py

df = df.set_index('gpstime')
for index, row in df.iterrows():
    locationF.write("p%s | %s | %s | %s | %s " % (str(cnt), index, str(row[0]), str(row[1]), str(row[2])) + '\n' )

import numpy as np
import pandas as pd

def _map(data, exp):                  
    for index, row in data.iterrows():   # 获取每行的index、row
        for col_name in data.columns:
            row[col_name] = exp(row[col_name]) # 把结果返回给data
    return data

def _1map(data, exp):
    _data = [[exp(row[col_name])               # 把结果转换成2级list
             for col_name in data.columns]
             for index, row in data.iterrows()
            ]
    return _data


if __name__ == "__main__":
    inp = [{
  
  'c1':10, 'c2':100}, {
  
  'c1':11,'c2':110}, {
  
  'c1':12,'c2':120}]
    df = pd.DataFrame(inp)
    temp = _map(df, lambda ele: ele+1 )
    print temp

    _temp = _1map(df, lambda ele: ele+1)
    res_data = pd.DataFrame(_temp)         # 对2级list转换成DataFrame
    print res_data

排序

通过列名来排序

#对于矩阵，axis=0表示行，1表示列
df.sort_index(axis=1, ascending=False)

通过某一列的数值排序

df.sort_values(by='B')

import pandas as pd

df = pd.read_csv('./query_result.csv', sep=',')
# 转为日期型
df['gpstime'] = pd.to_datetime(df['gpstime'])
# 按某一列排序
df.sort_values(['gpstime'])

选择

选择某一列

df['A']

选择某几行

df[0:3]
#也可以通过行的索引来选择，但是不能单独写某一行
df['20130102':'20130104']

选择几列转为矩阵

coords=dftest.as_matrix(columns=['longitude','latitude'])

过滤

pandas如何去掉、过滤数据集中的某些值或者某些行？

删除某列

方法一：直接del DF['column-name']

方法二：采用drop方法，有下面三种等价的表达式： 1. DF= DF.drop('column_name', 1)； 2. DF.drop('column_name',axis=1, inplace=True) # inplace=true表示对原DF操作，否则将结果生成在一个新的DF中 3. DF.drop(DF.columns[ : ], axis=1,inplace=True) # Note: zero indexed

pandas删除列

根据时间范围过滤

df = df.set_index('gpstime')
df['2018-04-22 01:00:00': '2018-04-22 05:00:00']

某一列按条件过滤

# python2适用
nightdf = nightdf[nightdf['speed']<1]
# python3中的
df06 = df04.loc[True - (float(df04.columns[-6]) > 0.0)]

groupby

利用pandas进行数据分组及可视化

pandas聚合和分组运算——GroupBy技术(1)

例1

from sklearn.datasets.samples_generator import make_blobs
from matplotlib import pyplot
from pandas import DataFrame
# generate 2d classification dataset
X, y = make_blobs(n_samples=100, centers=3, n_features=2)
# dict中定义三个key，分别是坐标和label，再通过dict创建DataFrame
df = DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {
  
  0:'red', 1:'blue', 2:'green'}
fig, ax = pyplot.subplots()
#groupby可以通过传入需要分组的参数实现对数据的分组
grouped = df.groupby('label')
for key, group in grouped:
   group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
pyplot.show()

例2

import pandas as pd
import matplotlib.pyplot as plt

# 把数据划分到自定义的区间中
def cla(n,lim):
    return'[%.f,%.f)'%(lim*(n//lim),lim*(n//lim)+lim) # map function

# 默认第一行是标题，从第二行开始是数据。sep是分隔符
df = pd.read_csv('/home/david/iaudience-plan-statistics.csv', sep=',')
# 设置某列的数据类型
df['precent'] = df['precent'].astype('float64')
# 对planid做group，group后对precent做sum
grouped = df['precent'].groupby(df['planid']).sum()

c = pd.DataFrame(grouped)
# 用c.precent或c['precent']都可以
addone = pd.Series([cla(s,1) for s in c.precent])
c['addone'] = addone
groups3 = c.groupby(['addone']).count()
groups3['precent'].plot('bar')
plt.show()

去重

https://blog.csdn.net/xinxing__8185/article/details/48022401

from pandas import Series, DataFrame 

data = DataFrame({'k': [1, 1, 2, 2]}) 

print data 

IsDuplicated = data.duplicated() 

print IsDuplicated  
print type(IsDuplicated) 

data = data.drop_duplicates() 
print data

DataFrame的duplicated方法返回一个布尔型Series,表示各行是否重复行。

而 drop_duplicates方法，它用于返回一个移除了重复行的DataFrame

这两个方法会判断全部列，你也可以指定部分列进行重复项判段。

例如，希望对名字为k2的列进行去重，

data.drop_duplicates([‘k2’])

应用

用kmeans聚类

import pandas as pd
import matplotlib.pyplot as plt
#读取文本数据到DataFrame中，将数据转换为matrix，保存在dataSet中
df = pd.read_table('d:/22.txt')
dataSet = df.as_matrix(columns=None)
# n_clusters=4，参数设置需要的分类这里设置成4类
kmeans = KMeans(n_clusters=4, random_state=0).fit(dataSet)
#center为各类的聚类中心，保存在df_center的DataFrame中给数据加上标签
center = kmeans.cluster_centers_
df_center = pd.DataFrame(center, columns=['x', 'y'])
#标注每个点的聚类结果
labels = kmeans.labels_
#将原始数据中的索引设置成得到的数据类别，根据索引提取各类数据并保存
df = pd.DataFrame(dataSet, index=labels, columns=['x', 'y'])
df1 = df[df.index==0]
df2 = df[df.index==1]
df3 = df[df.index==2]
df4 = df[df.index==3]
#绘图
plt.figure(figsize=(10,8), dpi=80)
axes = plt.subplot()
#s表示点大小，c表示color，marker表示点类型，DataFrame数据列引用参考博客其他文章
type1 = axes.scatter(df1.loc[:,['x']], df1.loc[:,['y']], s=50, c='red', marker='d')
type2 = axes.scatter(df2.loc[:,['x']], df2.loc[:,['y']], s=50, c='green', marker='*')
type3 = axes.scatter(df3.loc[:,['x']], df3.loc[:,['y']], s=50, c='brown', marker='p')
type4 = axes.scatter(df4.loc[:,['x']], df4.loc[:,['y']], s=50, c='black')
#显示聚类中心数据点
type_center = axes.scatter(df_center.loc[:,'x'], df_center.loc[:,'y'], s=40, c='blue')
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)
axes.legend((type1, type2, type3, type4, type_center), ('0','1','2','3','center'), loc=1)
plt.show()

问题记录

Pycharm Pandas无法绘图

最近用了pycharm，感觉还不错，就是pandas中Series、DataFrame的plot()方法不显示图片就给我结束了,但是我在ipython里就能画图

以前的代码是这样的

import matplotlib.pyplot as plt
from pandas import DataFrame,Series

Series([4,5,7]).plot()

发现只要加个

plt.show() 就可以显示图像了了

发布者：全栈程序员-用户IM，转载请注明出处：https://javaforall.cn/174611.html原文链接：https://javaforall.cn

【正版授权，激活自己账号】： Jetbrains全家桶Ide使用，1年售后保障，每天仅需1毛

【官方授权正版激活】： 官方授权正版激活支持Jetbrains家族下所有IDE 使用个人JB账号...

Pandas笔记_python总结笔记

创建数据

随机数据

通过Dict创建

创建TimeStamp

读取数据

读取csv

读取压缩包

查看数据

查看数据类型

查看head和tail

查看index、column和数据

显示数据的快速统计

筛选数据

转置

遍历

排序

通过列名来排序

通过某一列的数值排序

选择

选择某一列

选择某几行

选择几列转为矩阵

过滤

删除某列

根据时间范围过滤

某一列按条件过滤

groupby

去重

应用

用kmeans聚类

问题记录

Pycharm Pandas无法绘图

发表回复

Pandas笔记_python总结笔记

创建数据

随机数据

通过Dict创建

创建TimeStamp

读取数据

读取csv

读取压缩包

查看数据

查看数据类型

查看head和tail

查看index、column和数据

显示数据的快速统计

筛选数据

转置

遍历

排序

通过列名来排序

通过某一列的数值排序

选择

选择某一列

选择某几行

选择几列转为矩阵

过滤

删除某列

根据时间范围过滤

某一列按条件过滤

groupby

去重

应用

用kmeans聚类

问题记录

Pycharm Pandas无法绘图

相关推荐

mysql longtext问题

eclipse环境的搭建以及JDK的安装步骤详细[通俗易懂]

MATLAB循环_matlab如何循环计算

阿里云polardb_阿里云用的什么数据库

HostMyBytes

dnspod url转发_url解析

发表回复