Dask-glm¶
Dask-glm is a library for fitting Generalized Linear Models on large datasets
Dask-glm builds on the dask project to fit GLM‘s on datasets in parallel. It offers a scikit-learn compatible API for specifying your model.
Estimators¶
The estimators
module offers a scikit-learn compatible API for
specifying your model and hyper-parameters, and fitting your model to data.
>>> from dask_glm.estimators import LogisticRegression
>>> from dask_glm.datasets import make_classification
>>> X, y = make_classification()
>>> lr = LogisticRegression()
>>> lr.fit(X, y)
>>> lr
LogisticRegression(abstol=0.0001, fit_intercept=True, lamduh=1.0,
max_iter=100, over_relax=1, regularizer='l2', reltol=0.01, rho=1,
solver='admm', tol=0.0001)
All of the estimators follow a similar API. They can be instantiated with a set of parameters that control the fit, including whether to add an intercept, which solver to use, how to regularize the inputs, and various optimization parameters.
Given an instantiated estimator, you pass the data to the .fit
method.
It takes an X
, the feature matrix or exogenous data, and a y
the
target or endogenous data. Each of these can be a NumPy or dask array.
With a fit model, you can make new predictions using the .predict
method,
and can score known observations with the .score
method.
>>> lr.predict(X).compute()
array([False, False, False, True, ... True, False, True, True], dtype=bool)
See the API Reference for more.
Examples¶
A collection of notebooks demonstrating dask_glm
.
Scikit-Learn-style API¶
This example demontrates compatability with scikit-learn’s basic fit
API. For demonstration, we’ll use the perennial NYC taxi cab dataset.
In [1]:
import os
import s3fs
import pandas as pd
import dask.array as da
import dask.dataframe as dd
from distributed import Client
from dask import persist
from dask_glm.estimators import LogisticRegression
In [2]:
if not os.path.exists('trip.csv'):
s3 = S3FileSystem(anon=True)
s3.get("dask-data/nyc-taxi/2015/yellow_tripdata_2015-01.csv", "trip.csv")
In [3]:
client = Client()
In [4]:
ddf = dd.read_csv("trip.csv")
We can use the dask.dataframe
API to explore the dataset, and notice
that some of the values look suspicious:
In [5]:
ddf[['trip_distance', 'fare_amount']].describe().compute()
Out[5]:
trip_distance | fare_amount | |
---|---|---|
count | 1.274899e+07 | 1.274899e+07 |
mean | 1.345913e+01 | 1.190566e+01 |
std | 9.844094e+03 | 1.030254e+01 |
min | 0.000000e+00 | -4.500000e+02 |
25% | 1.000000e+00 | 6.500000e+00 |
50% | 1.700000e+00 | 9.000000e+00 |
75% | 3.100000e+00 | 1.350000e+01 |
max | 1.542000e+07 | 4.008000e+03 |
Scikit-learn doesn’t currently support filtering observations inside a pipeline (yet), so we’ll do this before anything else.
In [6]:
# these filter out less than 1% of the observations
ddf = ddf[(ddf.trip_distance < 20) &
(ddf.fare_amount < 150)]
Now, we’ll split our DataFrame into a train and test set, and select our feature matrix and target column (whether the passenger tipped).
In [7]:
df_train, df_test = ddf.random_split([0.80, 0.20], random_state=2)
columns = ['VendorID', 'passenger_count', 'trip_distance', 'payment_type', 'fare_amount']
X_train, y_train = df_train[columns], df_train['tip_amount'] > 0
X_test, y_test = df_test[columns], df_test['tip_amount'] > 0
X_train, y_train, X_test, y_test = persist(
X_train, y_train, X_test, y_test
)
With our training data in hand, we fit our logistic regression. Nothing
here should be surprising to those familiar with scikit-learn
.
In [8]:
%%time
# this is a *dask-glm* LogisticRegresion, not scikit-learn
lm = LogisticRegression(fit_intercept=False)
lm.fit(X_train.values, y_train.values)
CPU times: user 35.9 s, sys: 8.69 s, total: 44.6 s
Wall time: 9min 2s
Again, following the lead of scikit-learn we can measure the performance of the estimator on the training dataset:
In [9]:
lm.score(X_train.values, y_train.values).compute()
Out[9]:
0.90022477759757635
and on the test dataset:
In [10]:
lm.score(X_test.values, y_test.values).compute()
Out[10]:
0.90030262922441306
API Reference¶
Estimators¶
Models following scikit-learn’s estimator API.
-
class
dask_glm.estimators.
LinearRegression
(fit_intercept=True, solver='admm', regularizer='l2', max_iter=100, tol=0.0001, lamduh=1.0, rho=1, over_relax=1, abstol=0.0001, reltol=0.01)[source]¶ Esimator for a linear model using Ordinary Least Squares.
Parameters: fit_intercept : bool, default True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
solver : {‘admm’, ‘gradient_descent’, ‘newton’, ‘lbfgs’, ‘proximal_grad’}
Solver to use. See Algorithms for details
regularizer : {‘l1’, ‘l2’}
Regularizer to use. See Regularizers for details. Only used with
admm
andproximal_grad
solvers.max_iter : int, default 100
Maximum number of iterations taken for the solvers to converge
tol : float, default 1e-4
Tolerance for stopping criteria. Ignored for
admm
solverlambduh : float, default 1.0
Only used with
admm
andproximal_grad
solversrho, over_relax, abstol, reltol : float
Only used with the
admm
solver.Examples
>>> from dask_glm.datasets import make_regression >>> X, y = make_regression() >>> est = LinearRegression() >>> est.fit(X, y) >>> est.predict(X) >>> est.score(X, y)
Attributes
coef_ (array, shape (n_classes, n_features)) The learned value for the model’s coefficients intercept_ (float of None) The learned value for the intercept, if one was added to the model
-
class
dask_glm.estimators.
LogisticRegression
(fit_intercept=True, solver='admm', regularizer='l2', max_iter=100, tol=0.0001, lamduh=1.0, rho=1, over_relax=1, abstol=0.0001, reltol=0.01)[source]¶ Esimator for logistic regression.
Parameters: fit_intercept : bool, default True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
solver : {‘admm’, ‘gradient_descent’, ‘newton’, ‘lbfgs’, ‘proximal_grad’}
Solver to use. See Algorithms for details
regularizer : {‘l1’, ‘l2’}
Regularizer to use. See Regularizers for details. Only used with
admm
,lbfgs
, andproximal_grad
solvers.max_iter : int, default 100
Maximum number of iterations taken for the solvers to converge
tol : float, default 1e-4
Tolerance for stopping criteria. Ignored for
admm
solverlambduh : float, default 1.0
Only used with
admm
,lbfgs
andproximal_grad
solvers.rho, over_relax, abstol, reltol : float
Only used with the
admm
solver.Examples
>>> from dask_glm.datasets import make_classification >>> X, y = make_classification() >>> lr = LogisticRegression() >>> lr.fit(X, y) >>> lr.predict(X) >>> lr.predict_proba(X) >>> est.score(X, y)
Attributes
coef_ (array, shape (n_classes, n_features)) The learned value for the model’s coefficients intercept_ (float of None) The learned value for the intercept, if one was added to the model
-
class
dask_glm.estimators.
PoissonRegression
(fit_intercept=True, solver='admm', regularizer='l2', max_iter=100, tol=0.0001, lamduh=1.0, rho=1, over_relax=1, abstol=0.0001, reltol=0.01)[source]¶ Esimator for Poisson Regression.
Parameters: fit_intercept : bool, default True
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.
solver : {‘admm’, ‘gradient_descent’, ‘newton’, ‘lbfgs’, ‘proximal_grad’}
Solver to use. See Algorithms for details
regularizer : {‘l1’, ‘l2’}
Regularizer to use. See Regularizers for details. Only used with
admm
,lbfgs
, andproximal_grad
solvers.max_iter : int, default 100
Maximum number of iterations taken for the solvers to converge
tol : float, default 1e-4
Tolerance for stopping criteria. Ignored for
admm
solverlambduh : float, default 1.0
Only used with
admm
,lbfgs
andproximal_grad
solvers.rho, over_relax, abstol, reltol : float
Only used with the
admm
solver.Examples
>>> from dask_glm.datasets import make_poisson >>> X, y = make_poisson() >>> pr = PoissonRegression() >>> pr.fit(X, y) >>> pr.predict(X) >>> pr.get_deviance(X, y)
Attributes
coef_ (array, shape (n_classes, n_features)) The learned value for the model’s coefficients intercept_ (float of None) The learned value for the intercept, if one was added to the model
Families¶
-
class
dask_glm.families.
Logistic
[source]¶ Implements methods for Logistic regression, useful for classifying binary outcomes.
-
class
dask_glm.families.
Normal
[source]¶ Implements methods for Linear regression, useful for modeling continuous outcomes.
-
class
dask_glm.families.
Poisson
[source]¶ This implements Poisson regression, useful for modelling count data.
Algorithms¶
Optimization algorithms for solving minimizaiton problems.
-
dask_glm.algorithms.
admm
(X, y, regularizer='l1', lamduh=0.1, rho=1, over_relax=1, max_iter=250, abstol=0.0001, reltol=0.01, family=<class 'dask_glm.families.Logistic'>, **kwargs)[source]¶ Alternating Direction Method of Multipliers
Parameters: X : array-like, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
regularizer : str or Regularizer
lambuh : float
rho : float
over_relax : FLOAT
max_iter : int
maximum number of iterations to attempt before declaring failure to converge
abstol, reltol : float
family : Family
Returns: beta : array-like, shape (n_features,)
-
dask_glm.algorithms.
compute_stepsize_dask
(beta, step, Xbeta, Xstep, y, curr_val, family=<class 'dask_glm.families.Logistic'>, stepSize=1.0, armijoMult=0.1, backtrackMult=0.1)[source]¶ Compute the optimal stepsize
beta : array-like step : float XBeta : array-lie Xstep : y : array-like curr_val : float famlily : Family, optional stepSize : float, optional armijoMult : float, optional backtrackMult : float, optional
Returns: stepSize : flaot
beta : array-like
xBeta : array-like
func : callable
-
dask_glm.algorithms.
gradient_descent
(X, y, max_iter=100, tol=1e-14, family=<class 'dask_glm.families.Logistic'>, **kwargs)[source]¶ Michael Grant’s implementation of Gradient Descent.
Parameters: X : array-like, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
max_iter : int
maximum number of iterations to attempt before declaring failure to converge
tol : float
Maximum allowed change from prior iteration required to declare convergence
family : Family
Returns: beta : array-like, shape (n_features,)
-
dask_glm.algorithms.
lbfgs
(X, y, regularizer=None, lamduh=1.0, max_iter=100, tol=0.0001, family=<class 'dask_glm.families.Logistic'>, verbose=False, **kwargs)[source]¶ L-BFGS solver using scipy.optimize implementation
Parameters: X : array-like, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
max_iter : int
maximum number of iterations to attempt before declaring failure to converge
tol : float
Maximum allowed change from prior iteration required to declare convergence
family : Family
Returns: beta : array-like, shape (n_features,)
-
dask_glm.algorithms.
newton
(X, y, max_iter=50, tol=1e-08, family=<class 'dask_glm.families.Logistic'>, **kwargs)[source]¶ Newtons Method for Logistic Regression.
Parameters: X : array-like, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
max_iter : int
maximum number of iterations to attempt before declaring failure to converge
tol : float
Maximum allowed change from prior iteration required to declare convergence
family : Family
Returns: beta : array-like, shape (n_features,)
-
dask_glm.algorithms.
proximal_grad
(X, y, regularizer='l1', lamduh=0.1, family=<class 'dask_glm.families.Logistic'>, max_iter=100, tol=1e-08, **kwargs)[source]¶ Parameters: X : array-like, shape (n_samples, n_features)
y : array-like, shape (n_samples,)
max_iter : int
maximum number of iterations to attempt before declaring failure to converge
tol : float
Maximum allowed change from prior iteration required to declare convergence
family : Family
verbose : bool, default False
whether to print diagnostic information during convergence
Returns: beta : array-like, shape (n_features,)
Regularizers¶
Available Regularizers
¶
These regularizers are included with dask-glm.
Regularizer
Interface¶
Users wishing to implement their own regularizer should satisfy this interface.
-
class
dask_glm.regularizers.
Regularizer
[source]¶ Abstract base class for regularization object.
Defines the set of methods required to create a new regularization object. This includes the regularization functions itself and its gradient, hessian, and proximal operator.
-
add_reg_f
(f, lam)[source]¶ Add regularization function to other function.
Parameters: f : callable
Function taking
beta
and*args
lam : float
regularization constant
Returns: wrapped : callable
function taking
beta
and*args
-
add_reg_grad
(grad, lam)[source]¶ Add regularization gradient to other gradient function.
Parameters: grad : callable
Function taking
beta
and*args
lam : float
regularization constant
Returns: wrapped : callable
function taking
beta
and*args
-
add_reg_hessian
(hess, lam)[source]¶ Add regularization hessian to other hessian function.
Parameters: hess : callable
Function taking
beta
and*args
lam : float
regularization constant
Returns: wrapped : callable
function taking
beta
and*args
-
f
(beta)[source]¶ Regularization function.
Parameters: beta : array, shape (n_features,) Returns: result : float
-
classmethod
get
(obj)[source]¶ Get the concrete instance for the name
obj
.Parameters: obj : Regularizer or str
Valid instances of
Regularizer
are passed through. Strings are looked up according toobj.name
and a new instance is createdReturns: obj : Regularizer
-
gradient
(beta)[source]¶ Gradient of regularization function.
Parameters: beta : array, shape (n_features,)
Returns: gradient : array, shape (n_features,)
-