Welcome to pandas-ml’s documentation!¶
Contents:
What’s new¶
v0.6.0¶
Enhancement¶
- Support pandas v0.22.0 and scikit-learn 0.20.0.
API Change¶
- ModelFrame.model_selection.describe now returns ModelFrame compat with GridSearchCV.cv_results_
Deprecation¶
- Drop support of pandas v0.18.x or earlier
- Drop support of scikit-learn v0.18.x or earlier.
v0.4.0¶
Enhancement¶
- Support scikit-learn v0.17.x and v0.18.0.
- Support imbalanced-learn via
.imbalanceaccessor. See Handling imbalanced data. - Added
pandas_ml.ConfusionMatrixclass for easier classification results evaluation. See Confusion matrix.
Bug Fix¶
ModelFrame.columnsmay not be preserved via.transformusingFunctionTransformer,KernelCenterer,MaxAbsScalerandRobustScaler.
v0.3.1¶
Enhancement¶
inverse_transformnow reverts originalModelFrame.columnsinformation.
Bug Fix¶
- Assigning
SeriestoModelFrame.dataproperty raisesTypeError
v0.2.0¶
Enhancement¶
ModelFrame.transformcan preserve column names for somesklearn.preprocessingtransformation.- Added
ModelSeries.fit,transform,fit_transformandinverse_transformfor preprocessing purpose. ModelFramecan be initialized fromstatsmodelsdatasets.ModelFrame.cross_validation.iterateandModelFrame.cross_validation.train_test_splitnow keep index of original dataset, and addedreset_indexkeyword to control this behaviour.
Bug Fix¶
targetkw may be ignored when initializingModelFramewithnp.ndarrayandcolumnskwds.linear_model.enet_pathdoesn’t accept additional keywords.- Initializing
ModelFramewith namedSeriesmay have duplicated target columns. ModelFrame.target_namemay not be preserved when sliced.
v0.1.0¶
- Initial Release
Data Handling¶
Data Preparation¶
This section describes how to prepare basic data format named ModelFrame. ModelFrame defines a metadata to specify target (response variable) and data (explanatory variable / features). Using these metadata, ModelFrame can call other statistics/ML functions in more simple way.
You can create ModelFrame as the same manner as pandas.DataFrame. The below example shows how to create basic ModelFrame, which DOESN’T have target values.
>>> import pandas_ml as pdml
>>> df = pdml.ModelFrame({'A': [1, 2, 3], 'B': [2, 3, 4],
... 'C': [3, 4, 5]}, index=['a', 'b', 'c'])
>>> df
A B C
a 1 2 3
b 2 3 4
c 3 4 5
>>> type(df)
<class 'pandas_ml.core.frame.ModelFrame'>
You can check whether the created ModelFrame has target values using ModelFrame.has_target() method.
>>> df.has_target()
False
Target values can be specifyied via target keyword. You can simply pass a column name to be handled as target. Target column name can be confirmed via target_name property.
>>> df2 = pdml.ModelFrame({'A': [1, 2, 3], 'B': [2, 3, 4],
... 'C': [3, 4, 5]}, target='A')
>>> df2
A B C
0 1 2 3
1 2 3 4
2 3 4 5
>>> df2.has_target()
True
>>> df2.target_name
'A'
Also, you can pass any list-likes to be handled as a target. In this case, target column will be named as “.target”.
>>> df3 = pdml.ModelFrame({'A': [1, 2, 3], 'B': [2, 3, 4],
... 'C': [3, 4, 5]}, target=[4, 5, 6])
>>> df3
.target A B C
0 4 1 2 3
1 5 2 3 4
2 6 3 4 5
>>> df3.has_target()
True
>>> df3.target_name
'.target'
Also, you can pass pandas.DataFrame and pandas.Series as data and target.
>>> import pandas as pd
df4 = pdml.ModelFrame({'A': [1, 2, 3], 'B': [2, 3, 4],
... 'C': [3, 4, 5]}, target=pd.Series([4, 5, 6]))
>>> df4
.target A B C
0 4 1 2 3
1 5 2 3 4
2 6 3 4 5
>>> df4.has_target()
True
>>> df4.target_name
'.target'
Note
Target values are mandatory to perform operations which require response variable, such as regression and supervised learning.
Data Manipulation¶
You can maniluplate ModelFrame like pandas.DataFrame. Because ModelFrame inherits pandas.DataFrame, all the pandas methods / functions can be applied to ModelFrame.
Sliced results will be ModelSeries (simple wrapper for pandas.Series to support some data manipulation) or ModelFrame
>>> df
A B C
a 1 2 3
b 2 3 4
c 3 4 5
>>> sliced = df['A']
>>> sliced
a 1
b 2
c 3
Name: A, dtype: int64
>>> type(sliced)
<class 'pandas_ml.core.series.ModelSeries'>
>>> subset = df[['A', 'B']]
>>> subset
A B
a 1 2
b 2 3
c 3 4
>>> type(subset)
<class 'pandas_ml.core.frame.ModelFrame'>
ModelFrame has a special properties data to access data (features) and target to access target.
>>> df2
A B C
0 1 2 3
1 2 3 4
2 3 4 5
>>> df2.target_name
'A'
>>> df2.data
B C
0 2 3
1 3 4
2 4 5
>>> df2.target
0 1
1 2
2 3
Name: A, dtype: int64
You can update data and target via properties. Also, columns / value assignment are supported as the same as pandas.DataFrame.
>>> df2.target = [9, 9, 9]
>>> df2
A B C
0 9 2 3
1 9 3 4
2 9 4 5
>>> df2.data = pd.DataFrame({'X': [1, 2, 3], 'Y': [4, 5, 6]})
>>> df2
A X Y
0 9 1 4
1 9 2 5
2 9 3 6
>>> df2['X'] = [0, 0, 0]
>>> df2
A X Y
0 9 0 4
1 9 0 5
2 9 0 6
You can change target column specifying target_name property.
>>> df2.target_name
'A'
>>> df2.target_name = 'X'
>>> df2.target_name
'X'
If the specified column doesn’t exist in ModelFrame, it should reset target to None. Current target will be regarded as data.
>>> df2.target_name
'X'
>>> df2.target_name = 'XXXX'
>>> df2.has_target()
False
>>> df2.data
A X Y
0 9 0 4
1 9 0 5
2 9 0 6
Use scikit-learn¶
This section describes how to use scikit-learn functionalities via pandas-ml.
Basics¶
You can create ModelFrame instance from scikit-learn datasets directly.
>>> import pandas_ml as pdml
>>> import sklearn.datasets as datasets
>>> df = pdml.ModelFrame(datasets.load_iris())
>>> df.head()
.target sepal length (cm) sepal width (cm) petal length (cm) \
0 0 5.1 3.5 1.4
1 0 4.9 3.0 1.4
2 0 4.7 3.2 1.3
3 0 4.6 3.1 1.5
4 0 5.0 3.6 1.4
petal width (cm)
0 0.2
1 0.2
2 0.2
3 0.2
4 0.2
# make columns be readable
>>> df.columns = ['.target', 'sepal length', 'sepal width', 'petal length', 'petal width']
ModelFrame has accessor methods which makes easier access to scikit-learn namespace.
>>> df.cluster.KMeans
<class 'sklearn.cluster.k_means_.KMeans'>
Following table shows scikit-learn module and corresponding ModelFrame module. Some accessors has its abbreviated versions.
Thus, you can instanciate each estimator via ModelFrame accessors. Once create an estimator, you can pass it to ModelFrame.fit then predict. ModelFrame automatically uses its data and target properties for each operations.
>>> estimator = df.cluster.KMeans(n_clusters=3)
>>> df.fit(estimator)
>>> predicted = df.predict(estimator)
>>> predicted
0 1
1 1
2 1
...
147 2
148 2
149 0
Length: 150, dtype: int32
ModelFrame preserves the most recently used estimator in estimator atribute, and predicted results in predicted attibute.
>>> df.estimator
KMeans(copy_x=True, init='k-means++', max_iter=300, n_clusters=3, n_init=10,
n_jobs=1, precompute_distances=True, random_state=None, tol=0.0001,
verbose=0)
>>> df.predicted
0 1
1 1
2 1
...
147 2
148 2
149 0
Length: 150, dtype: int32
ModelFrame has following methods corresponding to various scikit-learn estimators. The last results are saved as corresponding ModelFrame properties.
ModelFrame method |
ModelFrame property |
|---|---|
ModelFrame.fit |
(None) |
ModelFrame.transform |
(None) |
ModelFrame.fit_transform |
(None) |
ModelFrame.inverse_transform |
(None) |
ModelFrame.predict |
ModelFrame.predicted |
ModelFrame.fit_predict |
ModelFrame.predicted |
ModelFrame.score |
(None) |
ModelFrame.predict_proba |
ModelFrame.proba |
ModelFrame.predict_log_proba |
ModelFrame.log_proba |
ModelFrame.decision_function |
ModelFrame.decision |
Note
If you access to a property before calling ModelFrame methods, ModelFrame automatically calls corresponding method of the latest estimator and return the result.
Following example shows to perform PCA, then revert principal components back to original space. inverse_transform should revert the original columns.
>>> estimator = df.decomposition.PCA()
>>> df.fit(estimator)
>>> transformed = df.transform(estimator)
>>> transformed.head()
.target 0 1 2 3
0 0 -2.684207 -0.326607 0.021512 0.001006
1 0 -2.715391 0.169557 0.203521 0.099602
2 0 -2.889820 0.137346 -0.024709 0.019305
3 0 -2.746437 0.311124 -0.037672 -0.075955
4 0 -2.728593 -0.333925 -0.096230 -0.063129
>>> type(transformed)
<class 'pandas_ml.core.frame.ModelFrame'>
>>> transformed.inverse_transform(estimator)
.target sepal length sepal width petal length petal width
0 0 5.1 3.5 1.4 0.2
1 0 4.9 3.0 1.4 0.2
2 0 4.7 3.2 1.3 0.2
3 0 4.6 3.1 1.5 0.2
4 0 5.0 3.6 1.4 0.2
.. ... ... ... ... ...
145 2 6.7 3.0 5.2 2.3
146 2 6.3 2.5 5.0 1.9
147 2 6.5 3.0 5.2 2.0
148 2 6.2 3.4 5.4 2.3
149 2 5.9 3.0 5.1 1.8
[150 rows x 5 columns]
If ModelFrame both has target and predicted values, the model evaluation can be performed using functions available in ModelFrame.metrics.
>>> estimator = df.svm.SVC()
>>> df.fit(estimator)
>>> df.predict(estimator)
0 0
1 0
2 0
...
147 2
148 2
149 2
Length: 150, dtype: int64
>>> df.predicted
0 0
1 0
2 0
...
147 2
148 2
149 2
Length: 150, dtype: int64
>>> df.metrics.confusion_matrix()
Predicted 0 1 2
Target
0 50 0 0
1 0 48 2
2 0 0 50
Use Module Level Functions¶
Some scikit-learn modules define functions which handle data without instanciating estimators. You can call these functions from accessor methods directly, and ModelFrame will pass corresponding data on background. Following example shows to use sklearn.cluster.k_means function to perform K-means.
Important
When you use module level function, ModelFrame.predicted WILL NOT be updated. Thus, using estimator is recommended.
# no need to pass data explicitly
# sklearn.cluster.kmeans returns centroids, cluster labels and inertia
>>> c, l, i = df.cluster.k_means(n_clusters=3)
>>> l
0 1
1 1
2 1
...
147 2
148 2
149 0
Length: 150, dtype: int32
Pipeline¶
ModelFrame can handle pipeline as the same as normal estimators.
>>> estimators = [('reduce_dim', df.decomposition.PCA()),
... ('svm', df.svm.SVC())]
>>> pipe = df.pipeline.Pipeline(estimators)
>>> df.fit(pipe)
>>> df.predict(pipe)
0 0
1 0
2 0
...
147 2
148 2
149 2
Length: 150, dtype: int64
Above expression is the same as below:
>>> df2 = df.copy()
>>> df2 = df2.fit_transform(df2.decomposition.PCA())
>>> svm = df2.svm.SVC()
>>> df2.fit(svm)
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
>>> df2.predict(svm)
0 0
1 0
2 0
...
147 2
148 2
149 2
Length: 150, dtype: int64
Cross Validation¶
scikit-learn has some classes for cross validation. model_selection.train_test_split splits data to training and test set. You can access to the function via model_selection accessor.
>>> train_df, test_df = df.model_selection.train_test_split()
>>> train_df
.target sepal length sepal width petal length petal width
124 2 6.7 3.3 5.7 2.1
117 2 7.7 3.8 6.7 2.2
123 2 6.3 2.7 4.9 1.8
65 1 6.7 3.1 4.4 1.4
133 2 6.3 2.8 5.1 1.5
.. ... ... ... ... ...
93 1 5.0 2.3 3.3 1.0
46 0 5.1 3.8 1.6 0.2
121 2 5.6 2.8 4.9 2.0
91 1 6.1 3.0 4.6 1.4
147 2 6.5 3.0 5.2 2.0
[112 rows x 5 columns]
>>> test_df
.target sepal length sepal width petal length petal width
146 2 6.3 2.5 5.0 1.9
75 1 6.6 3.0 4.4 1.4
138 2 6.0 3.0 4.8 1.8
77 1 6.7 3.0 5.0 1.7
36 0 5.5 3.5 1.3 0.2
.. ... ... ... ... ...
14 0 5.8 4.0 1.2 0.2
141 2 6.9 3.1 5.1 2.3
100 2 6.3 3.3 6.0 2.5
83 1 6.0 2.7 5.1 1.6
114 2 5.8 2.8 5.1 2.4
[38 rows x 5 columns]
You can iterate over Splitter classes via ModelFrame.model_selection.split which returns ModelFrame corresponding to training and test data.
>>> kf = df.model_selection.KFold(n_splits=3)
>>> for train_df, test_df in df.model_selection.iterate(kf):
... print('training set shape: ', train_df.shape,
... 'test set shape: ', test_df.shape)
training set shape: (112, 5) test set shape: (38, 5)
training set shape: (112, 5) test set shape: (38, 5)
training set shape: (112, 5) test set shape: (38, 5)
Grid Search¶
You can perform grid search using ModelFrame.fit.
>>> tuned_parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4],
... 'C': [1, 10, 100]},
... {'kernel': ['linear'], 'C': [1, 10, 100]}]
>>> df = pdml.ModelFrame(datasets.load_digits())
>>> cv = df.model_selection.GridSearchCV(df.svm.SVC(C=1), tuned_parameters,
... cv=5)
>>> df.fit(cv)
>>> cv.best_estimator_
SVC(C=10, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.001,
kernel='rbf', max_iter=-1, probability=False, random_state=None,
shrinking=True, tol=0.001, verbose=False)
In addition, ModelFrame.model_selection has a describe function to organize each grid search result as ModelFrame accepting estimator.
>>> df.model_selection.describe(cv)
mean std C gamma kernel
0 0.974108 0.013139 1 0.0010 rbf
1 0.951416 0.020010 1 0.0001 rbf
2 0.975372 0.011280 10 0.0010 rbf
3 0.962534 0.020218 10 0.0001 rbf
4 0.975372 0.011280 100 0.0010 rbf
5 0.964695 0.016686 100 0.0001 rbf
6 0.951811 0.018410 1 NaN linear
7 0.951811 0.018410 10 NaN linear
8 0.951811 0.018410 100 NaN linear
Handling imbalanced data¶
This section describes how to use
imbalanced-learn
functionalities via pandas-ml to handle imbalanced data.
Sampling¶
Assuming we have ModelFrame which has imbalanced target values. The ModelFrame has
data with 80 observations labeld with 0 and 20 observations labeled with 1.
>>> import numpy as np
>>> import pandas_ml as pdml
>>> df = pdml.ModelFrame(np.random.randn(100, 5),
... target=np.array([0, 1]).repeat([80, 20]),
... columns=list('ABCDE'))
>>> df
.target A B C D E
0 0 1.467859 1.637449 0.175770 0.189108 0.775139
1 0 -1.706293 -0.598930 -0.343427 0.355235 -1.348378
2 0 0.030542 0.393779 -1.891991 0.041062 0.055530
3 0 0.320321 -1.062963 -0.416418 -0.629776 1.126027
.. ... ... ... ... ... ...
96 1 -1.199039 0.055702 0.675555 -0.416601 -1.676259
97 1 -1.264182 -0.167390 -0.939794 -0.638733 -0.806794
98 1 -0.616754 1.667483 -1.858449 -0.259630 1.236777
99 1 -1.374068 -0.400435 -1.825555 0.824052 -0.335694
[100 rows x 6 columns]
>>> df.target.value_counts()
0 80
1 20
Name: .target, dtype: int64
You can access imbalanced-learn namespace via .imbalance accessor.
Passing instanciated under-sampling class to ModelFrame.fit_sample returns
under sampled ModelFrame (Note that .index is reset).
>>> sampler = df.imbalance.under_sampling.ClusterCentroids()
>>> sampler
ClusterCentroids(n_jobs=-1, random_state=None, ratio='auto')
>>> sampled = df.fit_sample(sampler)
>>> sampled
.target A B C D E
0 1 0.232841 -1.364282 1.436854 0.563796 -0.372866
1 1 -0.159551 0.473617 -2.024209 0.760444 -0.820403
2 1 1.495356 -2.144495 0.076485 1.219948 0.382995
3 1 -0.736887 1.399623 0.557098 0.621909 -0.507285
.. ... ... ... ... ... ...
36 0 0.429978 -1.421307 0.771368 1.704277 0.645590
37 0 1.408448 0.132760 -1.082301 -1.195149 0.155057
38 0 0.362793 -0.682171 1.026482 0.663343 -2.371229
39 0 -0.796293 -0.196428 -0.747574 2.228031 -0.468669
[40 rows x 6 columns]
>>> sampled.target.value_counts()
1 20
0 20
Name: .target, dtype: int64
As the same manner, you can perform over-sampling.
>>> sampler = df.imbalance.over_sampling.SMOTE()
>>> sampler
SMOTE(k=5, kind='regular', m=10, n_jobs=-1, out_step=0.5, random_state=None,
ratio='auto')
>>> sampled = df.fit_sample(sampler)
>>> sampled
.target A B C D E
0 0 1.467859 1.637449 0.175770 0.189108 0.775139
1 0 -1.706293 -0.598930 -0.343427 0.355235 -1.348378
2 0 0.030542 0.393779 -1.891991 0.041062 0.055530
3 0 0.320321 -1.062963 -0.416418 -0.629776 1.126027
.. ... ... ... ... ... ...
156 1 -1.279399 0.218171 -0.487836 -0.573564 0.582580
157 1 -0.736964 0.239095 -0.422025 -0.841780 0.221591
158 1 -0.273911 -0.305608 -0.886088 0.062414 -0.001241
159 1 0.073145 -0.167884 -0.781611 -0.016734 -0.045330
[160 rows x 6 columns]'
>>> sampled.target.value_counts()
1 80
0 80
Name: .target, dtype: int64
Following table shows imbalanced-learn module and corresponding ModelFrame module.
imbalanced-learn |
ModelFrame accessor |
|---|---|
imblearn.under_sampling |
ModelFrame.imbalance.under_sampling |
imblearn.over_sampling |
ModelFrame.imbalance.over_sampling |
imblearn.combine |
ModelFrame.imbalance.combine |
imblearn.ensemble |
ModelFrame.imbalance.ensemble |
Use XGBoost¶
This section describes how to use XGBoost functionalities via pandas-ml.
Use scikit-learn digits dataset as sample data.
>>> import pandas_ml as pdml
>>> import sklearn.datasets as datasets
>>> df = pdml.ModelFrame(datasets.load_digits())
>>> df.head()
.target 0 1 2 ... 60 61 62 63
0 0 0 0 5 ... 10 0 0 0
1 1 0 0 0 ... 16 10 0 0
2 2 0 0 0 ... 11 16 9 0
3 3 0 0 7 ... 13 9 0 0
4 4 0 0 0 ... 16 4 0 0
[5 rows x 65 columns]
As an estimator, XGBClassifier and XGBRegressor are available via xgboost accessor. See XGBoost Scikit-learn API for details.
>>> df.xgboost.XGBClassifier
<class 'xgboost.sklearn.XGBClassifier'>
>>> df.xgboost.XGBRegressor
<class 'xgboost.sklearn.XGBRegressor'>
You can use these estimators like scikit-learn estimators.
>>> train_df, test_df = df.model_selection.train_test_split()
>>> estimator = df.xgboost.XGBClassifier()
>>> train_df.fit(estimator)
XGBClassifier(base_score=0.5, colsample_bytree=1, gamma=0, learning_rate=0.1,
max_delta_step=0, max_depth=3, min_child_weight=1, missing=None,
n_estimators=100, nthread=-1, objective='multi:softprob', seed=0,
silent=True, subsample=1)
>>> predicted = test_df.predict(estimator)
>>> predicted
1371 2
1090 3
1299 2
...
1286 8
1632 3
538 2
dtype: int64
>>> test_df.metrics.confusion_matrix()
Predicted 0 1 2 3 ... 6 7 8 9
Target ...
0 53 0 0 0 ... 0 0 1 0
1 0 46 0 0 ... 0 0 0 0
2 0 0 51 1 ... 0 0 1 0
3 0 0 0 33 ... 0 0 1 0
4 0 0 0 0 ... 0 0 0 1
5 0 0 0 0 ... 1 0 0 1
6 0 0 0 0 ... 39 0 1 0
7 0 0 0 0 ... 0 40 0 1
8 1 0 0 0 ... 1 0 32 2
9 0 1 0 0 ... 0 1 1 51
[10 rows x 10 columns]
Also, plotting functions are available via xgboost accessor.
>>> train_df.xgboost.plot_importance()
# importance plot will be displayed
XGBoost estimators can be passed to other scikit-learn APIs.
Following example shows to perform a grid search.
>>> tuned_parameters = [{'max_depth': [3, 4]}]
>>> cv = df.model_selection.GridSearchCV(df.xgb.XGBClassifier(), tuned_parameters, cv=5)
>>> df.fit(cv)
>>> df.model_selection.describe(cv)
mean std max_depth
0 0.917641 0.032600 3
1 0.919310 0.026644 4
Use patsy¶
This section describes data transformation using patsy. ModelFrame.transform can accept patsy style formula.
>>> import pandas_ml as pdml
# create modelframe which doesn't have target
>>> df = pdml.ModelFrame({'X': [1, 2, 3], 'Y': [2, 3, 4],
... 'Z': [3, 4, 5]}, index=['a', 'b', 'c'])
>>> df
X Y Z
a 1 2 3
b 2 3 4
c 3 4 5
# transform with patsy formula
>>> transformed = df.transform('Z ~ Y + X')
>>> transformed
Z Intercept Y X
a 3 1 2 1
b 4 1 3 2
c 5 1 4 3
# transformed data should have target specified by formula
>>> transformed.target
a 3
b 4
c 5
Name: Z, dtype: float64
>>> transformed.data
Intercept Y X
a 1 2 1
b 1 3 2
c 1 4 3
If you do not want intercept, specify with 0.
>>> df.transform('Z ~ Y + 0')
Z Y
a 3 2
b 4 3
c 5 4
Also, you can use formula which doesn’t have left side.
# create modelframe which has target
>>> df2 = pdml.ModelFrame({'X': [1, 2, 3], 'Y': [2, 3, 4],'Z': [3, 4, 5]},
... target =[7, 8, 9], index=['a', 'b', 'c'])
>>> df2
.target X Y Z
a 7 1 2 3
b 8 2 3 4
c 9 3 4 5
# overwrite data with transformed data
>>> df2.data = df2.transform('Y + Z')
>>> df2
.target Intercept Y Z
a 7 1 2 3
b 8 1 3 4
c 9 1 4 5
# data has been updated based on formula
>>> df2.data
Intercept Y Z
a 1 2 3
b 1 3 4
c 1 4 5
# target is not changed
>>> df2.target
a 7
b 8
c 9
Name: .target, dtype: int64
Below example is performing deviation coding via patsy formula.
>>> df3 = pdml.ModelFrame({'X': [1, 2, 3, 4, 5], 'Y': [1, 3, 2, 2, 1],
... 'Z': [1, 1, 1, 2, 2]}, target='Z',
... index=['a', 'b', 'c', 'd', 'e'])
>>> df3
X Y Z
a 1 1 1
b 2 3 1
c 3 2 1
d 4 2 2
e 5 1 2
>>> df3.transform('C(X, Sum)')
Intercept C(X, Sum)[S.1] C(X, Sum)[S.2] C(X, Sum)[S.3] C(X, Sum)[S.4]
a 1 1 0 0 0
b 1 0 1 0 0
c 1 0 0 1 0
d 1 0 0 0 1
e 1 -1 -1 -1 -1
>>> df3.transform('C(Y, Sum)')
Intercept C(Y, Sum)[S.1] C(Y, Sum)[S.2]
a 1 1 0
b 1 -1 -1
c 1 0 1
d 1 0 1
e 1 1 0
Confusion matrix¶
Import ConfusionMatrix
from pandas_ml import ConfusionMatrix
Define actual values (y_true) and predicted values (y_pred)
y_true = ['rabbit', 'cat', 'rabbit', 'rabbit', 'cat', 'dog', 'dog', 'rabbit', 'rabbit', 'cat', 'dog', 'rabbit']
y_pred = ['cat', 'cat', 'rabbit', 'dog', 'cat', 'rabbit', 'dog', 'cat', 'rabbit', 'cat', 'rabbit', 'rabbit']
Let’s define a (non binary) confusion matrix
confusion_matrix = ConfusionMatrix(y_true, y_pred)
print("Confusion matrix:\n%s" % confusion_matrix)
You can see it
Predicted cat dog rabbit __all__
Actual
cat 3 0 0 3
dog 0 1 2 3
rabbit 2 1 3 6
__all__ 5 2 5 12
Matplotlib plot of a confusion matrix¶
Inside a IPython notebook add this line as first cell
%matplotlib inline
You can plot confusion matrix using:
import matplotlib.pyplot as plt
confusion_matrix.plot()
If you are not using inline mode, you need to use to show confusion matrix plot.
plt.show()
confusion_matrix
Matplotlib plot of a normalized confusion matrix¶
confusion_matrix.plot(normalized=True)
plt.show()
confusion_matrix_norm
Binary confusion matrix¶
If actual values (y_true) and predicted values (y_pred) are bool,
ConfusionMatrix outputs binary confusion matrix.
y_true = [ True, True, False, False, False, True, False, True, True,
False, True, False, False, False, False, False, True, False,
True, True, True, True, False, False, False, True, False,
True, False, False, False, False, True, True, False, False,
False, True, True, True, True, False, False, False, False,
True, False, False, False, False, False, False, False, False,
False, True, True, False, True, False, True, True, True,
False, False, True, False, True, False, False, True, False,
False, False, False, False, False, False, False, True, False,
True, True, True, True, False, False, True, False, True,
True, False, True, False, True, False, False, True, True,
False, False, True, True, False, False, False, False, False,
False, True, True, False]
y_pred = [False, False, False, False, False, True, False, False, True,
False, True, False, False, False, False, False, False, False,
True, True, True, True, False, False, False, False, False,
False, False, False, False, False, True, False, False, False,
False, True, False, False, False, False, False, False, False,
True, False, False, False, False, False, False, False, False,
False, True, False, False, False, False, False, False, False,
False, False, True, False, False, False, False, True, False,
False, False, False, False, False, False, False, True, False,
False, True, False, False, False, False, True, False, True,
True, False, False, False, True, False, False, True, True,
False, False, True, True, False, False, False, False, False,
False, True, False, False]
binary_confusion_matrix = ConfusionMatrix(y_true, y_pred)
print("Binary confusion matrix:\n%s" % binary_confusion_matrix)
It display as a nicely labeled Pandas DataFrame
Binary confusion matrix:
Predicted False True __all__
Actual
False 67 0 67
True 21 24 45
__all__ 88 24 112
You can get useful attributes such as True Positive (TP), True Negative (TN) …
print(binary_confusion_matrix.TP)
Matplotlib plot of a binary confusion matrix¶
binary_confusion_matrix.plot()
plt.show()
binary_confusion_matrix
Matplotlib plot of a normalized binary confusion matrix¶
binary_confusion_matrix.plot(normalized=True)
plt.show()
binary_confusion_matrix_norm
Seaborn plot of a binary confusion matrix (ToDo)¶
binary_confusion_matrix.plot(backend='seaborn')
Confusion matrix and class statistics¶
Overall statistics and class statistics of confusion matrix can be easily displayed.
y_true = [600, 200, 200, 200, 200, 200, 200, 200, 500, 500, 500, 200, 200, 200, 200, 200, 200, 200, 200, 200]
y_pred = [100, 200, 200, 100, 100, 200, 200, 200, 100, 200, 500, 100, 100, 100, 100, 100, 100, 100, 500, 200]
cm = ConfusionMatrix(y_true, y_pred)
cm.print_stats()
You should get:
Confusion Matrix:
Classes 100 200 500 600 __all__
Actual
100 0 0 0 0 0
200 9 6 1 0 16
500 1 1 1 0 3
600 1 0 0 0 1
__all__ 11 7 2 0 20
Overall Statistics:
Accuracy: 0.35
95% CI: (0.1539092047845412, 0.59218853453282805)
No Information Rate: ToDo
P-Value [Acc > NIR]: 0.978585644357
Kappa: 0.0780141843972
Mcnemar's Test P-Value: ToDo
Class Statistics:
Classes 100 200 500 600
Population 20 20 20 20
Condition positive 0 16 3 1
Condition negative 20 4 17 19
Test outcome positive 11 7 2 0
Test outcome negative 9 13 18 20
TP: True Positive 0 6 1 0
TN: True Negative 9 3 16 19
FP: False Positive 11 1 1 0
FN: False Negative 0 10 2 1
TPR: Sensivity NaN 0.375 0.3333333 0
TNR=SPC: Specificity 0.45 0.75 0.9411765 1
PPV: Pos Pred Value = Precision 0 0.8571429 0.5 NaN
NPV: Neg Pred Value 1 0.2307692 0.8888889 0.95
FPR: False-out 0.55 0.25 0.05882353 0
FDR: False Discovery Rate 1 0.1428571 0.5 NaN
FNR: Miss Rate NaN 0.625 0.6666667 1
ACC: Accuracy 0.45 0.45 0.85 0.95
F1 score 0 0.5217391 0.4 0
MCC: Matthews correlation coefficient NaN 0.1048285 0.326732 NaN
Informedness NaN 0.125 0.2745098 0
Markedness 0 0.08791209 0.3888889 NaN
Prevalence 0 0.8 0.15 0.05
LR+: Positive likelihood ratio NaN 1.5 5.666667 NaN
LR-: Negative likelihood ratio NaN 0.8333333 0.7083333 1
DOR: Diagnostic odds ratio NaN 1.8 8 NaN
FOR: False omission rate 0 0.7692308 0.1111111 0.05
Statistics are also available as an OrderedDict using:
cm.stats()
API:
pandas_ml.core package¶
Submodules¶
-
class
pandas_ml.core.frame.ModelFrame(data, target=None, *args, **kwargs)¶ Bases:
pandas_ml.core.generic.ModelPredictor,pandas.core.frame.DataFrameData structure subclassing
pandas.DataFrameto define a metadata to specify target (response variable) and data (explanatory variable / features).Parameters: - data : same as
pandas.DataFrame - target : str or array-like
Column name or values to be used as target
- args : arguments passed to
pandas.DataFrame - kwargs : keyword arguments passed to
pandas.DataFrame
-
calibration¶ Property to access
sklearn.calibration
-
cluster¶ Property to access
sklearn.cluster. Seepandas_ml.skaccessors.cluster
-
covariance¶ Property to access
sklearn.covariance. Seepandas_ml.skaccessors.covariance
-
cross_decomposition¶ Property to access
sklearn.cross_decomposition
-
da¶ Property to access
sklearn.discriminant_analysis
-
data¶ Return data (explanatory variable / features)
Returns: - data :
ModelFrame
- data :
-
decision_function(estimator, *args, **kwargs)¶ Call estimator’s decision_function method.
Parameters: - args : arguments passed to decision_function method
- kwargs : keyword arguments passed to decision_function method
Returns: - returned : decisions
-
decomposition¶ Property to access
sklearn.decomposition
-
discriminant_analysis¶ Property to access
sklearn.discriminant_analysis
-
dummy¶ Property to access
sklearn.dummy
-
ensemble¶ Property to access
sklearn.ensemble. Seepandas_ml.skaccessors.ensemble
-
feature_extraction¶ Property to access
sklearn.feature_extraction. Seepandas_ml.skaccessors.feature_extraction
-
feature_selection¶ Property to access
sklearn.feature_selection. Seepandas_ml.skaccessors.feature_selection
-
fit_predict(estimator, *args, **kwargs)¶ Call estimator’s fit_predict method.
Parameters: - args : arguments passed to fit_predict method
- kwargs : keyword arguments passed to fit_predict method
Returns: - returned : predicted result
-
fit_resample(estimator, *args, **kwargs)¶ Call estimator’s fit_resample method.
Parameters: - args : arguments passed to fit_resample method
- kwargs : keyword arguments passed to fit_resample method
Returns: - returned : resampling result
-
fit_sample(estimator, *args, **kwargs)¶ Call estimator’s fit_sample method.
Parameters: - args : arguments passed to fit_sample method
- kwargs : keyword arguments passed to fit_sample method
Returns: - returned : sampling result
-
fit_transform(estimator, *args, **kwargs)¶ Call estimator’s fit_transform method.
Parameters: - args : arguments passed to fit_transform method
- kwargs : keyword arguments passed to fit_transform method
Returns: - returned : transformed result
-
gaussian_process¶ Property to access
sklearn.gaussian_process. Seepandas_ml.skaccessors.gaussian_process
-
gp¶ Property to access
sklearn.gaussian_process. Seepandas_ml.skaccessors.gaussian_process
-
groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False)¶ Group DataFrame or Series using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Parameters: - by : mapping, function, label, or list of labels
Used to determine the groups for the groupby. If
byis a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns inself. Notice that a tuple is interpreted a (single) key.- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1).
- level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
- as_index : bool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
- sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
- group_keys : bool, default True
When calling apply, add group keys to index to identify pieces.
- squeeze : bool, default False
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
- observed : bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
New in version 0.23.0.
- **kwargs
Optional, only accepts keyword argument ‘mutated’ and is passed to groupby.
Returns: - DataFrameGroupBy or SeriesGroupBy
Depends on the calling object and returns groupby object that contains information about the groups.
See also
resample- Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more.
Examples
>>> df = pd.DataFrame({'Animal' : ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed' : [380., 370., 24., 26.]}) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 >>> df.groupby(['Animal']).mean() Max Speed Animal Falcon 375.0 Parrot 25.0
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Capitve', 'Wild', 'Capitve', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> df = pd.DataFrame({'Max Speed' : [390., 350., 30., 20.]}, ... index=index) >>> df Max Speed Animal Type Falcon Capitve 390.0 Wild 350.0 Parrot Capitve 30.0 Wild 20.0 >>> df.groupby(level=0).mean() Max Speed Animal Falcon 370.0 Parrot 25.0 >>> df.groupby(level=1).mean() Max Speed Type Capitve 210.0 Wild 185.0
-
has_data()¶ Return whether
ModelFramehas dataReturns: - has_data : bool
-
has_multi_targets()¶ Return whether
ModelFramehas multiple target columnsReturns: - has_multi_targets : bool
-
has_target()¶ Return whether
ModelFramehas targetReturns: - has_target : bool
-
imbalance¶ Property to access
imblearn
-
inverse_transform(estimator, *args, **kwargs)¶ Call estimator’s inverse_transform method.
Parameters: - args : arguments passed to inverse_transform method
- kwargs : keyword arguments passed to inverse_transform method
Returns: - returned : transformed result
-
isotonic¶ Property to access
sklearn.isotonic. Seepandas_ml.skaccessors.isotonic
-
kernel_approximation¶ Property to access
sklearn.kernel_approximation
-
kernel_ridge¶ Property to access
sklearn.kernel_ridge
-
lda¶ Property to access
sklearn.lda
-
linear_model¶ Property to access
sklearn.linear_model. Seepandas_ml.skaccessors.linear_model
-
lm¶ Property to access
sklearn.linear_model. Seepandas_ml.skaccessors.linear_model
-
manifold¶ Property to access
sklearn.manifold. Seepandas_ml.skaccessors.manifold
-
metrics¶ Property to access
sklearn.metrics. Seepandas_ml.skaccessors.metrics
-
mixture¶ Property to access
sklearn.mixture
-
model_selection¶ Property to access
sklearn.model_selection. Seepandas_ml.skaccessors.model_selection
-
ms¶ Property to access
sklearn.model_selection. Seepandas_ml.skaccessors.model_selection
-
multiclass¶ Property to access
sklearn.multiclass. Seepandas_ml.skaccessors.multiclass
-
multioutput¶ Property to access
sklearn.multioutput. Seepandas_ml.skaccessors.multioutput
-
naive_bayes¶ Property to access
sklearn.naive_bayes
-
neighbors¶ Property to access
sklearn.neighbors. Seepandas_ml.skaccessors.neighbors
-
neural_network¶ Property to access
sklearn.neural_network
-
pipeline¶ Property to access
sklearn.pipeline. Seepandas_ml.skaccessors.pipeline
-
pp¶ Property to access
sklearn.preprocessing. Seepandas_ml.skaccessors.preprocessing
-
predict_log_proba(estimator, *args, **kwargs)¶ Call estimator’s predict_log_proba method.
Parameters: - args : arguments passed to predict_log_proba method
- kwargs : keyword arguments passed to predict_log_proba method
Returns: - returned : probabilities
-
predict_proba(estimator, *args, **kwargs)¶ Call estimator’s predict_proba method.
Parameters: - args : arguments passed to predict_proba method
- kwargs : keyword arguments passed to predict_proba method
Returns: - returned : probabilities
-
preprocessing¶ Property to access
sklearn.preprocessing. Seepandas_ml.skaccessors.preprocessing
-
qda¶ Property to access
sklearn.qda
-
random_projection¶ Property to access
sklearn.random_projection. Seepandas_ml.skaccessors.random_projection
-
sample(estimator, *args, **kwargs)¶ Call estimator’s sample method.
Parameters: - args : arguments passed to sample method
- kwargs : keyword arguments passed to sample method
Returns: - returned : sampling result
-
score(estimator, *args, **kwargs)¶ Call estimator’s score method.
Parameters: - args : arguments passed to score method
- kwargs : keyword arguments passed to score method
Returns: - returned : score
-
seaborn¶ Property to access
seabornAPI
-
semi_supervised¶ Property to access
sklearn.semi_supervised. Seepandas_ml.skaccessors.semi_supervised
-
sns¶ Property to access
seabornAPI
-
svm¶ Property to access
sklearn.svm. Seepandas_ml.skaccessors.svm
-
target¶ Return target (response variable)
Returns: - target :
ModelSeries
- target :
-
target_name¶ Return target column name
Returns: - target : object
-
transform(estimator, *args, **kwargs)¶ Call estimator’s transform method.
Parameters: - args : arguments passed to transform method
- kwargs : keyword arguments passed to transform method
Returns: - returned : transformed result
-
tree¶ Property to access
sklearn.tree
-
xgb¶ Property to access
xgboost.sklearnAPI
-
xgboost¶ Property to access
xgboost.sklearnAPI
- data : same as
-
class
pandas_ml.core.generic.ModelPredictor¶ Bases:
pandas_ml.core.generic.ModelTransformerBase class for
ModelFrameandModelFrameGroupBy-
decision¶ Return current estimator’s decision function
Returns: - decisions :
ModelFrame
- decisions :
-
estimator¶ Return most recently used estimator
Returns: - estimator : estimator
-
log_proba¶ Return current estimator’s log probabilities
Returns: - probabilities :
ModelFrame
- probabilities :
-
predict(estimator, *args, **kwargs)¶ Call estimator’s predict method.
Parameters: - args : arguments passed to predict method
- kwargs : keyword arguments passed to predict method
Returns: - returned : predicted result
-
predicted¶ Return current estimator’s predicted results
Returns: - predicted :
ModelSeries
- predicted :
-
proba¶ Return current estimator’s probabilities
Returns: - probabilities :
ModelFrame
- probabilities :
-
-
class
pandas_ml.core.generic.ModelTransformer¶ Bases:
objectBase class for
ModelFrameandModelFrame-
fit(estimator, *args, **kwargs)¶ Call estimator’s fit method.
Parameters: - args : arguments passed to fit method
- kwargs : keyword arguments passed to fit method
Returns: - returned : None or fitted estimator
-
fit_transform(estimator, *args, **kwargs)¶ Call estimator’s fit_transform method.
Parameters: - args : arguments passed to fit_transform method
- kwargs : keyword arguments passed to fit_transform method
Returns: - returned : transformed result
-
inverse_transform(estimator, *args, **kwargs)¶ Call estimator’s inverse_transform method.
Parameters: - args : arguments passed to inverse_transform method
- kwargs : keyword arguments passed to inverse_transform method
Returns: - returned : transformed result
-
transform(estimator, *args, **kwargs)¶ Call estimator’s transform method.
Parameters: - args : arguments passed to transform method
- kwargs : keyword arguments passed to transform method
Returns: - returned : transformed result
-
-
class
pandas_ml.core.groupby.GroupedEstimator(estimator, grouped)¶ Bases:
pandas_ml.core.base._BaseEstimatorCreate grouped estimators based on passed estimator
-
class
pandas_ml.core.groupby.ModelFrameGroupBy(obj, keys=None, axis=0, level=None, grouper=None, exclusions=None, selection=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)¶ Bases:
pandas.core.groupby.generic.DataFrameGroupBy,pandas_ml.core.generic.ModelPredictor-
transform(func, *args, **kwargs)¶ Call estimator’s transform method.
Parameters: - args : arguments passed to transform method
- kwargs : keyword arguments passed to transform method
Returns: - returned : transformed result
-
-
class
pandas_ml.core.groupby.ModelSeriesGroupBy(obj, keys=None, axis=0, level=None, grouper=None, exclusions=None, selection=None, as_index=True, sort=True, group_keys=True, squeeze=False, observed=False, **kwargs)¶ Bases:
pandas.core.groupby.generic.SeriesGroupBy
-
pandas_ml.core.groupby.groupby(obj, by, **kwds)¶ Class for grouping and aggregating relational data.
See aggregate, transform, and apply functions on this object.
It’s easiest to use obj.groupby(…) to use GroupBy, but you can also do:
grouped = groupby(obj, ...)
Parameters: - obj : pandas object
- axis : int, default 0
- level : int, default None
Level of MultiIndex
- groupings : list of Grouping objects
Most users should ignore this
- exclusions : array-like, optional
List of columns to exclude
- name : string
Most users should ignore this
Returns: - **Attributes**
- groups : dict
{group name -> group labels}
- len(grouped) : int
Number of groups
Notes
After grouping, see aggregate, apply, and transform functions. Here are some other brief notes about usage. When grouping by multiple groups, the result index will be a MultiIndex (hierarchical) by default.
Iteration produces (key, group) tuples, i.e. chunking the data by group. So you can write code like:
grouped = obj.groupby(keys, axis=axis) for key, group in grouped: # do something with the data
Function calls on GroupBy, if not specially implemented, “dispatch” to the grouped data. So if you group a DataFrame and wish to invoke the std() method on each group, you can simply do:
df.groupby(mapper).std()
rather than
df.groupby(mapper).aggregate(np.std)
You can pass arguments to these “wrapped” functions, too.
See the online documentation for full exposition on these topics and much more
-
class
pandas_ml.core.series.ModelSeries(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)¶ Bases:
pandas_ml.core.generic.ModelTransformer,pandas.core.series.SeriesWrapper for
pandas.Seriesto supportsklearn.preprocessing-
groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=False)¶ Group DataFrame or Series using a mapper or by a Series of columns.
A groupby operation involves some combination of splitting the object, applying a function, and combining the results. This can be used to group large amounts of data and compute operations on these groups.
Parameters: - by : mapping, function, label, or list of labels
Used to determine the groups for the groupby. If
byis a function, it’s called on each value of the object’s index. If a dict or Series is passed, the Series or dict VALUES will be used to determine the groups (the Series’ values are first aligned; see.align()method). If an ndarray is passed, the values are used as-is determine the groups. A label or list of labels may be passed to group by the columns inself. Notice that a tuple is interpreted a (single) key.- axis : {0 or ‘index’, 1 or ‘columns’}, default 0
Split along rows (0) or columns (1).
- level : int, level name, or sequence of such, default None
If the axis is a MultiIndex (hierarchical), group by a particular level or levels.
- as_index : bool, default True
For aggregated output, return object with group labels as the index. Only relevant for DataFrame input. as_index=False is effectively “SQL-style” grouped output.
- sort : bool, default True
Sort group keys. Get better performance by turning this off. Note this does not influence the order of observations within each group. Groupby preserves the order of rows within each group.
- group_keys : bool, default True
When calling apply, add group keys to index to identify pieces.
- squeeze : bool, default False
Reduce the dimensionality of the return type if possible, otherwise return a consistent type.
- observed : bool, default False
This only applies if any of the groupers are Categoricals. If True: only show observed values for categorical groupers. If False: show all values for categorical groupers.
New in version 0.23.0.
- **kwargs
Optional, only accepts keyword argument ‘mutated’ and is passed to groupby.
Returns: - DataFrameGroupBy or SeriesGroupBy
Depends on the calling object and returns groupby object that contains information about the groups.
See also
resample- Convenience method for frequency conversion and resampling of time series.
Notes
See the user guide for more.
Examples
>>> df = pd.DataFrame({'Animal' : ['Falcon', 'Falcon', ... 'Parrot', 'Parrot'], ... 'Max Speed' : [380., 370., 24., 26.]}) >>> df Animal Max Speed 0 Falcon 380.0 1 Falcon 370.0 2 Parrot 24.0 3 Parrot 26.0 >>> df.groupby(['Animal']).mean() Max Speed Animal Falcon 375.0 Parrot 25.0
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'], ... ['Capitve', 'Wild', 'Capitve', 'Wild']] >>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type')) >>> df = pd.DataFrame({'Max Speed' : [390., 350., 30., 20.]}, ... index=index) >>> df Max Speed Animal Type Falcon Capitve 390.0 Wild 350.0 Parrot Capitve 30.0 Wild 20.0 >>> df.groupby(level=0).mean() Max Speed Animal Falcon 370.0 Parrot 25.0 >>> df.groupby(level=1).mean() Max Speed Type Capitve 210.0 Wild 185.0
-
pp¶ Property to access
sklearn.preprocessing. Seepandas_ml.skaccessors.preprocessing
-
preprocessing¶ Property to access
sklearn.preprocessing. Seepandas_ml.skaccessors.preprocessing
-
to_frame(name=None)¶ Convert Series to DataFrame.
Parameters: - name : object, default None
The passed name should substitute for the series name (if it has one).
Returns: - data_frame : DataFrame
-
transform(estimator, *args, **kwargs)¶ Call estimator’s transform method.
Parameters: - args : arguments passed to transform method
- kwargs : keyword arguments passed to transform method
Returns: - returned : transformed result
-
Module contents¶
pandas_ml.skaccessors package¶
Subpackages¶
Submodules¶
-
class
pandas_ml.skaccessors.cluster.ClusterMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.cluster.-
affinity_propagation(*args, **kwargs)¶ Call
sklearn.cluster.affinity_propagationusing automatic mapping.S:ModelFrame.data
-
bicluster¶ Property to access
sklearn.cluster.bicluster
-
dbscan(*args, **kwargs)¶ Call
sklearn.cluster.dbscanusing automatic mapping.X:ModelFrame.data
-
k_means(n_clusters, *args, **kwargs)¶ Call
sklearn.cluster.k_meansusing automatic mapping.X:ModelFrame.data
-
mean_shift(*args, **kwargs)¶ Call
sklearn.cluster.mean_shiftusing automatic mapping.X:ModelFrame.data
-
spectral_clustering(*args, **kwargs)¶ Call
sklearn.cluster.spectral_clusteringusing automatic mapping.affinity:ModelFrame.data
-
-
class
pandas_ml.skaccessors.covariance.CovarianceMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.covariance.-
empirical_covariance(*args, **kwargs)¶ Call
sklearn.covariance.empirical_covarianceusing automatic mapping.X:ModelFrame.data
-
ledoit_wolf(*args, **kwargs)¶ Call
sklearn.covariance.ledoit_wolfusing automatic mapping.X:ModelFrame.data
-
oas(*args, **kwargs)¶ Call
sklearn.covariance.oasusing automatic mapping.X:ModelFrame.data
-
-
class
pandas_ml.skaccessors.cross_decomposition.CrossDecompositionMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.cross_decomposition.
-
class
pandas_ml.skaccessors.decomposition.DecompositionMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.decomposition.-
dict_learning(n_components, alpha, *args, **kwargs)¶ Call
sklearn.decomposition.dict_learningusing automatic mapping.X:ModelFrame.data
-
dict_learning_online(*args, **kwargs)¶ Call
sklearn.decomposition.dict_learning_onlineusing automatic mapping.X:ModelFrame.data
-
fastica(*args, **kwargs)¶ Call
sklearn.decomposition.fasticausing automatic mapping.X:ModelFrame.data
-
sparse_encode(dictionary, *args, **kwargs)¶ Call
sklearn.decomposition.sparce_encodeusing automatic mapping.X:ModelFrame.data
-
-
class
pandas_ml.skaccessors.ensemble.EnsembleMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.ensemble.-
partial_dependence¶ Property to access
sklearn.ensemble.partial_dependence
-
-
class
pandas_ml.skaccessors.ensemble.PartialDependenceMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethods-
partial_dependence(gbrt, target_variables, **kwargs)¶ Call
sklearn.ensemble.partial_dependenceusing automatic mapping.X:ModelFrame.data
-
plot_partial_dependence(gbrt, features, **kwargs)¶ Call
sklearn.ensemble.plot_partial_dependenceusing automatic mapping.X:ModelFrame.data
-
-
class
pandas_ml.skaccessors.feature_extraction.FeatureExtractionMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.feature_extraction.-
image¶ Property to access
sklearn.feature_extraction.image
-
text¶ Property to access
sklearn.feature_extraction.text
-
-
class
pandas_ml.skaccessors.feature_selection.FeatureSelectionMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.feature_selection.
-
class
pandas_ml.skaccessors.gaussian_process.GaussianProcessMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.gaussian_process.-
correlation_models¶ Property to access
sklearn.gaussian_process.correlation_models
-
regression_models¶ Property to access
sklearn.gaussian_process.regression_models
-
-
class
pandas_ml.skaccessors.gaussian_process.RegressionModelsMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethods
-
class
pandas_ml.skaccessors.isotonic.IsotonicMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.isotonic.-
IsotonicRegression¶ sklearn.isotonic.IsotonicRegression
-
check_increasing(*args, **kwargs)¶ Call
sklearn.isotonic.check_increasingusing automatic mapping.x:ModelFrame.indexy:ModelFrame.target
-
isotonic_regression(*args, **kwargs)¶ Call
sklearn.isotonic.isotonic_regressionusing automatic mapping.y:ModelFrame.target
-
-
class
pandas_ml.skaccessors.linear_model.LinearModelMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.linear_model.-
enet_path(*args, **kwargs)¶ Call
sklearn.linear_model.enet_pathusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
lars_path(*args, **kwargs)¶ Call
sklearn.linear_model.lars_pathusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
lasso_path(*args, **kwargs)¶ Call
sklearn.linear_model.lasso_pathusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
lasso_stability_path(*args, **kwargs)¶ Call
sklearn.linear_model.lasso_stability_pathusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
orthogonal_mp_gram(*args, **kwargs)¶ Call
sklearn.linear_model.orthogonal_mp_gramusing automatic mapping.Gram:ModelFrame.data.T.dot(ModelFrame.data)Xy:ModelFrame.data.T.dot(ModelFrame.target)
-
-
class
pandas_ml.skaccessors.manifold.ManifoldMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.manifold.-
locally_linear_embedding(n_neighbors, n_components, *args, **kwargs)¶ Call
sklearn.manifold.locally_linear_embeddingusing automatic mapping.X:ModelFrame.data
-
spectral_embedding(*args, **kwargs)¶ Call
sklearn.manifold.spectral_embeddingusing automatic mapping.adjacency:ModelFrame.data
-
-
class
pandas_ml.skaccessors.metrics.MetricsMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.metrics.-
auc(kind='roc', reorder=False, **kwargs)¶ Calcurate AUC of ROC curve or precision recall curve
Parameters: - kind : {‘roc’, ‘precision_recall_curve’}
Returns: - float : AUC
-
average_precision_score(*args, **kwargs)¶ Call
sklearn.metrics.average_precision_scoreusing automatic mapping.y_true:ModelFrame.targety_score:ModelFrame.decision
-
confusion_matrix(*args, **kwargs)¶ Call
sklearn.metrics.confusion_matrixusing automatic mapping.y_true:ModelFrame.targety_pred:ModelFrame.predicted
-
consensus_score(*args, **kwargs)¶ Not implemented
-
f1_score(*args, **kwargs)¶ Call
sklearn.metrics.f1_scoreusing automatic mapping.y_true:ModelFrame.targety_pred:ModelFrame.predicted
-
fbeta_score(beta, *args, **kwargs)¶ Call
sklearn.metrics.fbeta_scoreusing automatic mapping.y_true:ModelFrame.targety_pred:ModelFrame.predicted
-
hinge_loss(*args, **kwargs)¶ Call
sklearn.metrics.hinge_lossusing automatic mapping.y_true:ModelFrame.targety_pred_decision:ModelFrame.decision
-
log_loss(*args, **kwargs)¶ Call
sklearn.metrics.log_lossusing automatic mapping.y_true:ModelFrame.targety_pred:ModelFrame.proba
-
pairwise¶ Not implemented
-
precision_recall_curve(*args, **kwargs)¶ Call
sklearn.metrics.precision_recall_curveusing automatic mapping.y_true:ModelFrame.targety_probas_pred:ModelFrame.decision
-
precision_recall_fscore_support(*args, **kwargs)¶ Call
sklearn.metrics.precision_recall_fscore_supportusing automatic mapping.y_true:ModelFrame.targety_pred:ModelFrame.predicted
-
precision_score(*args, **kwargs)¶ Call
sklearn.metrics.precision_scoreusing automatic mapping.y_true:ModelFrame.targety_pred:ModelFrame.predicted
-
recall_score(*args, **kwargs)¶ Call
sklearn.metrics.recall_scoreusing automatic mapping.y_true:ModelFrame.targety_true:ModelFrame.predicted
-
roc_auc_score(*args, **kwargs)¶ Call
sklearn.metrics.roc_auc_scoreusing automatic mapping.y_true:ModelFrame.targety_score:ModelFrame.decision
-
roc_curve(*args, **kwargs)¶ Call
sklearn.metrics.roc_curveusing automatic mapping.y_true:ModelFrame.targety_score:ModelFrame.decision
-
silhouette_samples(*args, **kwargs)¶ Call
sklearn.metrics.silhouette_samplesusing automatic mapping.X:ModelFrame.datalabels:ModelFrame.predicted
-
silhouette_score(*args, **kwargs)¶ Call
sklearn.metrics.silhouette_scoreusing automatic mapping.X:ModelFrame.datalabels:ModelFrame.predicted
-
-
class
pandas_ml.skaccessors.model_selection.ModelSelectionMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.model_selection.-
StratifiedShuffleSplit(*args, **kwargs)¶ Instanciate
sklearn.cross_validation.StratifiedShuffleSplitusing automatic mapping.y:ModelFrame.target
-
check_cv(cv, *args, **kwargs)¶ Call
sklearn.cross_validation.check_cvusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
cross_val_score(estimator, *args, **kwargs)¶ Call
sklearn.cross_validation.cross_val_scoreusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
describe(estimator)¶ Describe grid search results
Parameters: - estimator : fitted grid search estimator
Returns: - described :
ModelFrame
-
iterate(cv, reset_index=False)¶ deprecated. Use .split
-
learning_curve(estimator, *args, **kwargs)¶ Call
sklearn.lerning_curve.learning_curveusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
permutation_test_score(estimator, *args, **kwargs)¶ Call
sklearn.cross_validation.permutation_test_scoreusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
split(cv, reset_index=False)¶ Generate
ModelFrameusing iterators for cross validationParameters: - cv : cross validation iterator
- reset_index : bool
logical value whether to reset index, default False
Returns: - generated : generator of
ModelFrame
-
train_test_split(reset_index=False, *args, **kwargs)¶ Call
sklearn.cross_validation.train_test_splitusing automatic mapping.Parameters: - reset_index : bool
logical value whether to reset index, default False
- kwargs : keywords passed to
cross_validation.train_test_split
Returns: - train, test : tuple of
ModelFrame
-
validation_curve(estimator, param_name, param_range, *args, **kwargs)¶ Call
sklearn.learning_curve.validation_curveusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
-
class
pandas_ml.skaccessors.neighbors.NeighborsMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.neighbors.
-
class
pandas_ml.skaccessors.pipeline.PipelineMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.pipeline.-
make_pipeline¶ sklearn.pipeline.make_pipeline
-
make_union¶ sklearn.pipeline.make_union
-
-
class
pandas_ml.skaccessors.preprocessing.PreprocessingMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.preprocessing.-
add_dummy_feature(value=1.0)¶ Call
sklearn.preprocessing.add_dummy_featureusing automatic mapping.X:ModelFrame.data
-
-
class
pandas_ml.skaccessors.svm.SVMMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
sklearn.svm.-
l1_min_c(*args, **kwargs)¶ Call
sklearn.svm.l1_min_cusing automatic mapping.X:ModelFrame.datay:ModelFrame.target
-
liblinear¶ Not implemented
-
libsvm¶ Not implemented
-
libsvm_sparse¶ Not implemented
-
Module contents¶
pandas_ml.xgboost package¶
Submodules¶
-
class
pandas_ml.xgboost.base.XGBoostMethods(df, module_name=None, attrs=None)¶ Bases:
pandas_ml.core.accessor._AccessorMethodsAccessor to
xgboost.-
XGBClassifier¶
-
XGBRegressor¶
-
plot_importance(ax=None, height=0.2, xlim=None, title='Feature importance', xlabel='F score', ylabel='Features', grid=True, **kwargs)¶ Plot importance based on fitted trees.
Parameters: - ax : matplotlib Axes, default None
Target axes instance. If None, new figure and axes will be created.
- height : float, default 0.2
Bar height, passed to ax.barh()
- xlim : tuple, default None
Tuple passed to axes.xlim()
- title : str, default “Feature importance”
Axes title. To disable, pass None.
- xlabel : str, default “F score”
X axis title label. To disable, pass None.
- ylabel : str, default “Features”
Y axis title label. To disable, pass None.
- kwargs :
Other keywords passed to ax.barh()
Returns: - ax : matplotlib Axes
-
plot_tree(num_trees=0, rankdir='UT', ax=None, **kwargs)¶ Plot specified tree.
Parameters: - booster : Booster, XGBModel
Booster or XGBModel instance
- num_trees : int, default 0
Specify the ordinal number of target tree
- rankdir : str, default “UT”
Passed to graphiz via graph_attr
- ax : matplotlib Axes, default None
Target axes instance. If None, new figure and axes will be created.
- kwargs :
Other keywords passed to to_graphviz
Returns: - ax : matplotlib Axes
-
to_graphviz(num_trees=0, rankdir='UT', yes_color='#0000FF', no_color='#FF0000', **kwargs)¶ Convert specified tree to graphviz instance. IPython can automatically plot the returned graphiz instance. Otherwise, you shoud call .render() method of the returned graphiz instance.
Parameters: - num_trees : int, default 0
Specify the ordinal number of target tree
- rankdir : str, default “UT”
Passed to graphiz via graph_attr
- yes_color : str, default ‘#0000FF’
Edge color when meets the node condigion.
- no_color : str, default ‘#FF0000’
Edge color when doesn’t meet the node condigion.
- kwargs :
Other keywords passed to graphviz graph_attr
Returns: - ax : matplotlib Axes
-