kaggle实战之房价预测-高级回归技巧（一）

简介：

对于新手而言，利用比赛进行实战可以更好地巩固基础并及时发现自己的不足。本篇文章主要讲解比赛的重头戏——数据预处理部分。高质量的数据往往决定了机器学习准确率的上限，因此这一块内容格外重要。

进入之后就可以看到如图所示的界面，这个比赛是预测房价，也就是回归问题

本文代码建议全程使用jupyter进行

1.拿到数据先别慌，总览一下，看看数据的分布情况

#导入所需要的库
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#利用pandas读取训练数据
df = pandas.read_csv('train.csv')
#查看数据帧的前五行，总览一下数据由哪些变量组成
df.head(5)

输出结果如下所示（如有需要，每一个特征的具体含义可以自行去kaggle网站查看）：

Id	MSSubClass	MSZoning	LotFrontage	LotArea	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Neighborhood	Condition1	Condition2	BldgType	HouseStyle	OverallQual	OverallCond	YearBuilt	YearRemodAdd	RoofStyle	RoofMatl	Exterior1st	Exterior2nd	MasVnrType	MasVnrArea	ExterQual	ExterCond	Foundation	BsmtQual	BsmtCond	BsmtExposure	BsmtFinType1	BsmtFinSF1	BsmtFinType2	BsmtUnfSF	TotalBsmtSF	Heating	HeatingQC	CentralAir	Electrical	1stFlrSF	2ndFlrSF	GrLivArea	BsmtFullBath	BsmtHalfBath	FullBath	HalfBath	BedroomAbvGr	KitchenAbvGr	KitchenQual	TotRmsAbvGrd	Functional	Fireplaces	FireplaceQu	GarageType	GarageYrBlt	GarageFinish	GarageCars	GarageArea	GarageQual	GarageCond	PavedDrive	WoodDeckSF	OpenPorchSF	EnclosedPorch	PoolQC	Fence	MiscFeature	MoSold	YrSold	SaleType	SaleCondition	SalePrice
1	60	RL	65	8450	Pave	NA	Reg	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2003	2003	Gable	CompShg	VinylSd	VinylSd	BrkFace	196	Gd	TA	PConc	Gd	TA	No	GLQ	706	Unf	150	856	GasA	Ex	Y	SBrkr	856	854	1710	1	0	2	1	3	1	Gd	8	Typ	0	NA	Attchd	2003	RFn	2	548	TA	TA	Y	0	61	0	NA	NA	NA	2	2008	WD	Normal	208500
2	20	RL	80	9600	Pave	NA	Reg	Lvl	AllPub	FR2	Gtl	Veenker	Feedr	Norm	1Fam	1Story	6	8	1976	1976	Gable	CompShg	MetalSd	MetalSd	None	0	TA	TA	CBlock	Gd	TA	Gd	ALQ	978	Unf	284	1262	GasA	Ex	Y	SBrkr	1262	0	1262	0	1	2	0	3	1	TA	6	Typ	1	TA	Attchd	1976	RFn	2	460	TA	TA	Y	298	0	0	NA	NA	NA	5	2007	WD	Normal	181500
3	60	RL	68	11250	Pave	NA	IR1	Lvl	AllPub	Inside	Gtl	CollgCr	Norm	Norm	1Fam	2Story	7	5	2001	2002	Gable	CompShg	VinylSd	VinylSd	BrkFace	162	Gd	TA	PConc	Gd	TA	Mn	GLQ	486	Unf	434	920	GasA	Ex	Y	SBrkr	920	866	1786	1	0	2	1	3	1	Gd	6	Typ	1	TA	Attchd	2001	RFn	2	608	TA	TA	Y	0	42	0	NA	NA	NA	9	2008	WD	Normal	223500
4	70	RL	60	9550	Pave	NA	IR1	Lvl	AllPub	Corner	Gtl	Crawfor	Norm	Norm	1Fam	2Story	7	5	1915	1970	Gable	CompShg	Wd Sdng	Wd Shng	None	0	TA	TA	BrkTil	TA	Gd	No	ALQ	216	Unf	540	756	GasA	Gd	Y	SBrkr	961	756	1717	1	0	1	0	3	1	Gd	7	Typ	1	Gd	Detchd	1998	Unf	3	642	TA	TA	Y	0	35	272	NA	NA	NA	2	2006	WD	Abnorml	140000
5	60	RL	84	14260	Pave	NA	IR1	Lvl	AllPub	FR2	Gtl	NoRidge	Norm	Norm	1Fam	2Story	8	5	2000	2000	Gable	CompShg	VinylSd	VinylSd	BrkFace	350	Gd	TA	PConc	Gd	TA	Av	GLQ	655	Unf	490	1145	GasA	Ex	Y	SBrkr	1145	1053	2198	1	0	2	1	4	1	Gd	9	Typ	1	TA	Attchd	2000	RFn	3	836	TA	TA	Y	192	84	0	NA	NA	NA	12	2008	WD	Normal	250000

1 2	#对于MSZoning这种文字类别的特征，可以利用pandas的计数函数来观察一下 df['MSZoning'].value_counts()

2.缺失值处理

2.1首先利用seaborn的heatmap整体看一下数据缺失值的情况：

1
2
3

#可视化pandas自带的df.isnull()函数来直观看一下数据缺失值的情况
#yticklabels=False列标签不要，cbar=False不要color_bar
sns.heatmap(df.isnull(),yticklabels=False,cbar=False)

可视化效果如下，白色越多说明数据缺失值越多：

2.2 可视化只是为了直观感觉一下，还是需要详细看一下数据的信息

#看一下数据维度是多少，方便比对缺失情况
print(df.shape)
#查看数据的信息
df.info()

输出如下：

这里只截取了部分，可以看到训练数据是有1460条的，LotFrontage有1201个非空值，属于部分缺失，Alley这个数据只有91条非空值，属于严重缺失数据。有了这些信息，我们便可以进行处理了：

2.3处理

1.对于严重缺失的数据，也没有必要去填充了，直接删除就好了

2.对于部分缺失的数据，如果是数值型，可以用平均值进行填充，如果是object类型，可以用众数进行填充

由于是刚上手，先按照常规操作填充一下，后续再根据预测结果进行适当调整

根据df.info()里面的信息，对缺失值进行如下处理：

#严重缺失的数据按照column进行丢弃
df.drop(['Alley'],axis=1,inplace=True)
df.drop(['GarageYrBlt'],axis=1,inplace=True)
df.drop(['PoolQC','Fence','MiscFeature'],axis=1,inplace=True)
#id也可以直接删掉，没什么用
df.drop(['Id'],axis=1,inplace=True)

#部分缺失值为数值型的用平均值填充
df['LotFrontage']=df['LotFrontage'].fillna(df['LotFrontage'].mean())

#部分缺失值为object类型的用数据中用较多的值进行填充
df['BsmtCond']=df['BsmtCond'].fillna(df['BsmtCond'].mode()[0])
df['BsmtQual']=df['BsmtQual'].fillna(df['BsmtQual'].mode()[0])
df['BsmtFinType2']=df['BsmtFinType2'].fillna(df['BsmtFinType2'].mode()[0])
df['BsmtExposure']=df['BsmtExposure'].fillna(df['BsmtExposure'].mode()[0])
df['BsmtFinType1']=df['BsmtFinType1'].fillna(df['BsmtFinType1'].mode()[0])
df['FireplaceQu']=df['FireplaceQu'].fillna(df['FireplaceQu'].mode()[0])
df['GarageType']=df['GarageType'].fillna(df['GarageType'].mode()[0])
df['GarageFinish']=df['GarageFinish'].fillna(df['GarageFinish'].mode()[0])
df['GarageQual']=df['GarageQual'].fillna(df['GarageQual'].mode()[0])
df['GarageCond']=df['GarageCond'].fillna(df['GarageCond'].mode()[0])
df['MasVnrType']=df['MasVnrType'].fillna(df['MasVnrType'].mode()[0])
df['MasVnrArea']=df['MasVnrArea'].fillna(df['MasVnrArea'].mode()[0])