点击可以看别人总结的DecisionTreeClassifier决策树分类器
这个DecisionTreeClassifier属于分类树
还有另一种是回归树DecisionTreeRegression
我们先来调用包sklearn 中的tree我们一点一点学sklearn
from sklearn import tree
有人愿意产看源代码可以看下面哈,我觉得来这搜的都不愿意看,我们理论懂就好了,然后用起来
clf=tree.DecisionTreeClassifier()
clf
我们一点一点分解DecisionTreeClassifier() 记住这是驼峰写法就好了,以后只要看到sklearn就知道作者使用的是驼峰写法。
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False,
random_state=None, splitter='best')
假如:
DecisionTreeClassifier(criterion=‘entropy’, min_samples_leaf=3)函数为创建一个决策树模型,其函数的参数含义如下所示:
class_weight : 指定样本各类别的的权重,主要是为了防止训练集某些类别的样本过多导致训练的决策树过于偏向这些类别。这里可以自己指定各个样本的权重,如果使用“balanced”,则算法会自己计算权重,样本量少的类别所对应的样本权重会高。
criterion : gini或者entropy,前者是基尼系数,后者是信息熵;
max_depth : int or None, optional (default=None) 设置决策随机森林中的决策树的最大深度,深度越大,越容易过拟合,推荐树的深度为:5-20之间;
max_features: None(所有),log2,sqrt,N 特征小于50的时候一般使用所有的;
max_leaf_nodes : 通过限制最大叶子节点数,可以防止过拟合,默认是"None”,即不限制最大的叶子节点数。
min_impurity_decrease :
random_state :
min_impurity_split: 这个值限制了决策树的增长,如果某节点的不纯度(基尼系数,信息增益,均方差,绝对差)小于这个阈值则该节点不再生成子节点。即为叶子节点 。
min_samples_leaf : 这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。
min_samples_split : 设置结点的最小样本数量,当样本数量可能小于此值时,结点将不会在划分。
min_weight_fraction_leaf: 这个值限制了叶子节点所有样本权重和的最小值,如果小于这个值,则会和兄弟节点一起被剪枝默认是0,就是不考虑权重问题。
presort :
splitter : best or random 前者是在所有特征中找最好的切分点 后者是在部分特征中,默认的”best”适合样本量不大的时候,而如果样本数据量非常大,此时决策树构建推荐”random” 。
更多内容VX关注【小猪课堂】公众号,你想要的干活都在这里
源代码如下,我是不愿意看的
============================================================
Help on DecisionTreeClassifier in module sklearn.tree.tree object:
class DecisionTreeClassifier(BaseDecisionTree, sklearn.base.ClassifierMixin)
| DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)
|
| A decision tree classifier.
|
| Read more in the :ref:`User Guide <tree>`.
|
| Parameters
| ----------
| criterion : string, optional (default="gini")
| The function to measure the quality of a split. Supported criteria are
| "gini" for the Gini impurity and "entropy" for the information gain.
|
| splitter : string, optional (default="best")
| The strategy used to choose the split at each node. Supported
| strategies are "best" to choose the best split and "random" to choose
| the best random split.
|
| max_depth : int or None, optional (default=None)
| The maximum depth of the tree. If None, then nodes are expanded until
| all leaves are pure or until all leaves contain less than
| min_samples_split samples.
|
| min_samples_split : int, float, optional (default=2)
| The minimum number of samples required to split an internal node:
|
| - If int, then consider `min_samples_split` as the minimum number.
| - If float, then `min_samples_split` is a fraction and
| `ceil(min_samples_split * n_samples)` are the minimum
| number of samples for each split.
|
| .. versionchanged:: 0.18
| Added float values for fractions.
|
| min_samples_leaf : int, float, optional (default=1)
| The minimum number of samples required to be at a leaf node.
| A split point at any depth will only be considered if it leaves at
| least ``min_samples_leaf`` training samples in each of the left and
| right branches. This may have the effect of smoothing the model,
| especially in regression.
|
| - If int, then consider `min_samples_leaf` as the minimum number.
| - If float, then `min_samples_leaf` is a fraction and
| `ceil(min_samples_leaf * n_samples)` are the minimum
| number of samples for each node.
|
| .. versionchanged:: 0.18
| Added float values for fractions.
|
| min_weight_fraction_leaf : float, optional (default=0.)
| The minimum weighted fraction of the sum total of weights (of all
| the input samples) required to be at a leaf node. Samples have
| equal weight when sample_weight is not provided.
|
| max_features : int, float, string or None, optional (default=None)
| The number of features to consider when looking for the best split:
|
| - If int, then consider `max_features` features at each split.
| - If float, then `max_features` is a fraction and
| `int(max_features * n_features)` features are considered at each
| split.
| - If "auto", then `max_features=sqrt(n_features)`.
| - If "sqrt", then `max_features=sqrt(n_features)`.
| - If "log2", then `max_features=log2(n_features)`.
| - If None, then `max_features=n_features`.
|
| Note: the search for a split does not stop until at least one
| valid partition of the node samples is found, even if it requires to
| effectively inspect more than ``max_features`` features.
|
| random_state : int, RandomState instance or None, optional (default=None)
| If int, random_state is the seed used by the random number generator;
| If RandomState instance, random_state is the random number generator;
| If None, the random number generator is the RandomState instance used
| by `np.random`.
|
| max_leaf_nodes : int or None, optional (default=None)
| Grow a tree with ``max_leaf_nodes`` in best-first fashion.
| Best nodes are defined as relative reduction in impurity.
| If None then unlimited number of leaf nodes.
|