数据清洗之 异常值处理

网友投稿 1394 2022-11-19 22:50:08

数据清洗之 异常值处理

异常值处理

指那些偏离正常范围的值,不是错误值异常值出现频率较低,但又会对实际项目分析造成偏差异常值一般用过箱线图法(分位差法)或者分布图(标准差法)来判断异常值检测可以使用均值的二倍标准差范围,也可以使用上下4分位数差方法异常值往往采取盖帽法或者数据离散化

import pandas as pdimport numpy as npimport

os.getcwd()

'D:\\Jupyter\\notebook\\Python数据清洗实战\\数据清洗之数据预处理'

os.chdir('D:\\Jupyter\\notebook\\Python数据清洗实战\\数据')

df = pd.read_csv('MotorcycleData.csv', encoding='gbk', na_values='Na')

def f(x): if '$' in str(x): x = str(x).strip('$') x = str(x).replace(',', '') else: x = str(x).replace(',', '') return float(x)

df['Price'] = df['Price'].apply(f)

df['Mileage'] = df['Mileage'].apply(f)

df.head(5)

Condition

Condition_Desc

Price

Location

Model_Year

Mileage

Exterior_Color

Make

Warranty

Model

...

Vehicle_Title

OBO

Feedback_Perc

Watch_Count

N_Reviews

Seller_Status

Vehicle_Tile

Auction

Buy_Now

Bid_Count

0

Used

mint!!! very low miles

11412.0

McHenry, Illinois, United States

2013.0

16000.0

Black

Harley-Davidson

Unspecified

Touring

...

NaN

FALSE

8.1

NaN

2427

Private Seller

Clear

True

FALSE

28.0

1

Used

Perfect condition

17200.0

Fort Recovery, Ohio, United States

2016.0

60.0

Black

Harley-Davidson

Vehicle has an existing warranty

Touring

...

NaN

FALSE

100

17

657

Private Seller

Clear

True

TRUE

0.0

2

Used

NaN

3872.0

Chicago, Illinois, United States

1970.0

25763.0

Silver/Blue

BMW

Vehicle does NOT have an existing warranty

R-Series

...

NaN

FALSE

100

NaN

136

NaN

Clear

True

FALSE

26.0

3

Used

CLEAN TITLE READY TO RIDE HOME

6575.0

Green Bay, Wisconsin, United States

2009.0

33142.0

Red

Harley-Davidson

NaN

Touring

...

NaN

FALSE

100

NaN

2920

Dealer

Clear

True

FALSE

11.0

4

Used

NaN

10000.0

West Bend, Wisconsin, United States

2012.0

17800.0

Blue

Harley-Davidson

NO WARRANTY

Touring

...

NaN

FALSE

100

13

271

OWNER

Clear

True

TRUE

0.0

5 rows × 22 columns

# 对价格异常值处理# 计算价格均值x_bar = df['Price'].mean()

# 计算价格标准差x_std = df['Price'].std()

# 异常值上限检测any(df['Price'] > x_bar + 2 * x_std)

True

# 异常值下限检测any(df['Price'] < x_bar - 2 * x_std)

False

# 描述性统计df['Price'].describe()

count 7493.000000mean 9968.811557std 8497.326850min 0.00000025% 4158.00000050% 7995.00000075% 13000.000000max 100000.000000Name: Price, dtype: float64

# 25% 分位数Q1 = df['Price'].quantile(q = 0.25)

# 75% 分位数Q3 = df['Price'].quantile(q = 0.75)

# 分位差IQR = Q3 -

any(df['Price'] > Q3 + 1.5 * IQR)

True

any(df['Price'] < Q1 - 1.5 * IQR)

False

import matplotlib.pyplot as

%matplotlib inline

df['Price'].plot(kind='box')

# 设置绘图风格plt.style.use('seaborn')# 绘制直方图df.Price.plot(kind='hist', bins=30, density=True)# 绘制核密度图df.Price.plot(kind='kde')# 图形展现plt.show()

# 用99分位数和1分位数替换# 计算P1和P99P99 = df['Price'].quantile(q=0.99)P1 = df['Price'].quantile(q=0.01)

P99

39995.32

df['Price_new'] = df['Price']

# 盖帽法df.loc[df['Price'] > P99, 'Price_new'] = P99df.loc[df['Price'] < P1, 'Price_new'] =

df[['Price', 'Price_new']].describe()

Price

Price_new

count

7493.000000

7493.000000

mean

9968.811557

9821.220873

std

8497.326850

7737.092537

min

0.000000

100.000000

25%

4158.000000

4158.000000

50%

7995.000000

7995.000000

75%

13000.000000

13000.000000

max

100000.000000

39995.320000

# df['Price_new'].plot(kind='box')

版权声明:本文内容由网络用户投稿,版权归原作者所有,本站不拥有其著作权,亦不承担相应法律责任。如果您发现本站中有涉嫌抄袭或描述失实的内容,请联系我们jiasou666@gmail.com 处理,核实后本网站将在24小时内删除侵权内容。

上一篇:docker--发布docker镜像
下一篇:图的应用——最小生成树
相关文章