Welcome to dryscrape’s documentation!¶
dryscrape is a lightweight web scraping library for Python. It uses a headless Webkit instance to evaluate Javascript on the visited pages. This enables painless scraping of plain web pages as well as Javascript-heavy “Web 2.0” applications like Facebook.
It is built on the shoulders of capybara-webkit‘s webkit-server. A big thanks goes to thoughtbot, inc. for building this excellent piece of software!
Contents¶
Installation¶
Prerequisites¶
Before installing dryscrape, you need to install some software it depends on:
On Ubuntu you can do that with one command (the # indicates that you need
root privileges for this):
# apt-get install qt5-default libqt5webkit5-dev build-essential \
python-lxml python-pip xvfb
Please note that Qt4 is also supported.
On Mac OS X, you can use Homebrew to install Qt and easy_install to install pip:
# brew install qt
# easy_install pip
On other operating systems, you can use pip to install lxml (though you might have to install libxml and the Python headers first).
Recommended: Installing dryscrape from PyPI¶
This is as simple as a quick
# pip install dryscrape
Note that dryscrape supports Python 2.7 and 3 as of version 1.0.
Installing dryscrape from Git¶
First, get a copy of dryscrape using Git:
$ git clone https://github.com/niklasb/dryscrape.git dryscrape
$ cd dryscrape
To install dryscrape, you first need to install webkit-server. You can use pip to do this for you (while still in the dryscrape directory).
# pip install -r requirements.txt
If you want, you can of course also install the dependencies manually.
Afterwards, you can use the setup.py script included to install dryscrape:
# python setup.py install
Usage¶
First demonstration¶
A code sample tells more than thousand words:
import dryscrape
import sys
if 'linux' in sys.platform:
# start xvfb in case no X is running. Make sure xvfb
# is installed, otherwise this won't work!
dryscrape.start_xvfb()
search_term = 'dryscrape'
# set up a web scraping session
sess = dryscrape.Session(base_url = 'http://google.com')
# we don't need images
sess.set_attribute('auto_load_images', False)
# visit homepage and search for a term
sess.visit('/')
q = sess.at_xpath('//*[@name="q"]')
q.set(search_term)
q.form().submit()
# extract all links
for link in sess.xpath('//a[@href]'):
print(link['href'])
# save a screenshot of the web page
sess.render('google.png')
print("Screenshot written to 'google.png'")
In this sample, we use dryscrape to do a simple web search on Google.
Note that we set up a Webkit driver instance here and pass it to a dryscrape
Session in the constructor. The session instance
then passes every method call it cannot resolve – such as
visit(), in this case – to the
underlying driver.
API Documentation¶
This documentation also contains the API docs for the webkit_server
module, for convenience (and because I am too lazy to set up dedicated docs
for it).
Overview¶
Module dryscrape.session¶
-
class
dryscrape.session.Session(driver=None, base_url=None)[source]¶ Bases:
objectA web scraping session based on a driver instance. Implements the proxy pattern to pass unresolved method calls to the underlying driver.
If no driver is specified, the instance will create an instance of
dryscrape.session.DefaultDriverto get a driver instance (defaults todryscrape.driver.webkit.Driver).If base_url is present, relative URLs are completed with this URL base. If not, the get_base_url method is called on itself to get the base URL.
Module dryscrape.mixins¶
Mixins for use in dryscrape drivers.
-
class
dryscrape.mixins.AttributeMixin[source]¶ Bases:
objectMixin that adds
[]access syntax sugar to an object that supports aset_attrandget_attrmethod.
-
class
dryscrape.mixins.HtmlParsingMixin[source]¶ Bases:
objectMixin that adds a
documentmethod to an object that supports abodymethod returning valid HTML.
-
class
dryscrape.mixins.SelectionMixin[source]¶ Bases:
objectMixin that adds different methods of node selection to an object that provides an
xpathmethod returning a collection of matches.
-
class
dryscrape.mixins.WaitMixin[source]¶ Bases:
dryscrape.mixins.SelectionMixinMixin that allows waiting for conditions or elements.
-
at_css(css, timeout=1, **kw)[source]¶ Returns the first node matching the given CSSv3 expression or
Noneif a timeout occurs.
-
at_xpath(xpath, timeout=1, **kw)[source]¶ Returns the first node matching the given XPath 2.0 expression or
Noneif a timeout occurs.
-
Module dryscrape.driver.webkit¶
Headless Webkit driver for dryscrape. Wraps the webkit_server module.
-
class
dryscrape.driver.webkit.Driver(**kw)[source]¶ Bases:
webkit_server.Client,dryscrape.mixins.WaitMixin,dryscrape.mixins.HtmlParsingMixinDriver implementation wrapping a
webkit_serverdriver.Keyword arguments are passed through to the underlying
webkit_server.Clientconstructor. By default, node_factory_class is set to use the dryscrape node implementation.
-
class
dryscrape.driver.webkit.Node(client, node_id)[source]¶ Bases:
webkit_server.Node,dryscrape.mixins.SelectionMixin,dryscrape.mixins.AttributeMixinNode implementation wrapping a
webkit_servernode.
-
class
dryscrape.driver.webkit.NodeFactory(client)[source]¶ Bases:
webkit_server.NodeFactoryoverrides the NodeFactory provided by
webkit_server.
Module webkit_server¶
Python bindings for the webkit-server
-
class
webkit_server.Client(connection=None, node_factory_class=<class 'webkit_server.NodeFactory'>)[source]¶ Bases:
webkit_server.SelectionMixinWrappers for the webkit_server commands.
If connection is not specified, a new instance of
ServerConnectionis created.node_factory_class can be set to a value different from the default, in which case a new instance of the given class will be used to create nodes. The given class must accept a client instance through its constructor and support a
createmethod that takes a node ID as an argument and returns a node object.Deletes all cookies.
Returns a list of all cookies in cookie string format.
-
eval_script(expr)[source]¶ Evaluates a piece of Javascript in the context of the current page and returns its value.
-
headers()[source]¶ Returns a list of the last HTTP response headers. Header keys are normalized to capitalized form, as in User-Agent.
-
render(path, width=1024, height=1024)[source]¶ Renders the current page to a PNG file (viewport size in pixels).
-
set_attribute(attr, value=True)[source]¶ Sets a custom attribute for our Webkit instance. Possible attributes are:
auto_load_imagesdns_prefetch_enabledplugins_enabledprivate_browsing_enabledjavascript_can_open_windowsjavascript_can_access_clipboardoffline_storage_database_enabledoffline_web_application_cache_enabledlocal_storage_enabledlocal_storage_database_enabledlocal_content_can_access_remote_urlslocal_content_can_access_file_urlsaccelerated_compositing_enabledsite_specific_quirks_enabled
For all those options,
valuemust be a boolean. You can find more information about these options in the QT docs.
Sets a cookie for future requests (must be in correct cookie string format).
-
set_error_tolerant(tolerant=True)[source]¶ DEPRECATED! This function is a no-op now.
Used to set or unset the error tolerance flag in the server. If this flag as set, dropped requests or erroneous responses would not lead to an error.
-
set_html(html, url=None)[source]¶ Sets custom HTML in our Webkit session and allows to specify a fake URL. Scripts and CSS is dynamically fetched as if the HTML had been loaded from the given URL.
-
exception
webkit_server.EndOfStreamError(msg='Unexpected end of file')[source]¶ Bases:
exceptions.ExceptionRaised when the Webkit server closed the connection unexpectedly.
-
exception
webkit_server.InvalidResponseError[source]¶ Bases:
exceptions.ExceptionRaised when the Webkit server signaled an error.
-
exception
webkit_server.NoResponseError[source]¶ Bases:
exceptions.ExceptionRaised when the Webkit server does not respond.
-
exception
webkit_server.NoX11Error[source]¶ Bases:
webkit_server.WebkitServerErrorRaised when the Webkit server cannot connect to X.
-
class
webkit_server.Node(client, node_id)[source]¶ Bases:
webkit_server.SelectionMixinRepresents a DOM node in our Webkit session.
client is the associated client instance.
node_id is the internal ID that is used to identify the node when communicating with the server.
-
eval_script(js)[source]¶ Evaluate arbitrary Javascript with the
nodevariable bound to the current node.
-
exec_script(js)[source]¶ Execute arbitrary Javascript with the
nodevariable bound to the current node.
-
-
exception
webkit_server.NodeError[source]¶ Bases:
exceptions.ExceptionA problem occured within a
Nodeinstance method.
-
class
webkit_server.NodeFactory(client)[source]¶ Bases:
objectImplements the default node factory.
client is the associated client instance.
-
class
webkit_server.SelectionMixin[source]¶ Bases:
objectImplements a generic XPath selection for a class providing
_get_xpath_ids,_get_css_idsandget_node_factorymethods.
-
class
webkit_server.Server(binary=None)[source]¶ Bases:
objectManages a Webkit server process. If binary is given, the specified
webkit_serverbinary is used instead of the included one.
-
class
webkit_server.ServerConnection(server=None)[source]¶ Bases:
objectA connection to a Webkit server.
server is a server instance or None if a singleton server should be connected to (will be started if necessary).
-
class
webkit_server.SocketBuffer(f)[source]¶ Bases:
objectA convenience class for buffered reads from a socket.