Welcome to Wpull’s documentation!

Homepage:https://github.com/chfoo/wpull

Contents:

Introduction

Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.

A dog pulling a box via a harness.

Notable Features:

  • Written in Python: lightweight, modifiable, robust, & scriptable
  • Graceful stopping; on-disk database resume
  • PhantomJS & youtube-dl integration (experimental)

Wpull is designed to be (almost) a drop-in replacement for Wget with minimal changes to options. It is designed to run on much larger crawls rather than speedily downloading a single file.

Wpull’s behavior is not an exact duplicate of Wget’s behavior. As such, you should not expect exact output and operation out of Wpull. However, it aims to be a very useful alternative as its source code can be easily modified to fix, change, or extend its behaviors.

For instructions, read on to the next sections. Confused? Check out the Frequently Asked Questions.

Installation

Requirements

Wpull requires the following:

The following are optional:

  • psutil for monitoring disk space
  • Manhole for a REPL debugging socket
  • PhantomJS 1.9.8, 2.1 for capturing interactive JavaScript pages
  • youtube-dl for downloading complex video streaming sites

For installing Wpull, it is recommended to use pip installer.

Wpull is officially supported in a Unix-like environment.

Automatic Install

Once you have installed Python, lxml, and pip, install Wpull with dependencies automatically from PyPI:

pip3 install wpull

Tip

Adding the --upgrade option will upgrade Wpull to the latest release. Use --no-dependencies to only upgrade Wpull.

Adding the --user option will install Wpull into your home directory.

Automatic install is usually the best option. However, there may be outstanding fixes to bugs that are not yet released to PyPI. In this case, use the manual install.

Manual Install

Install the dependencies known to work with Wpull:

pip3 install -r https://raw2.github.com/chfoo/wpull/master/requirements.txt

Install Wpull from GitHub:

pip3 install git+https://github.com/chfoo/wpull.git#egg=wpull

Tip

Using git+https://github.com/chfoo/wpull.git@develop#egg=wpull as the path will install Wpull’s develop branch.

psutil

psutil is required for the disk and memory monitoring options but may not be available. To install:

pip3 install psutil

Pre-built Binaries

Wpull has pre-built binaries located at https://launchpad.net/wpull/+download. These are unsupported and may not be up to date.

Caveats

Python

Please obtain the latest Python release from http://python.org/download/ or your package manager. It is recommended to use Python 3.4.3 or greater. Versions 3.4 and 3.4 are officially supported.

Python 2 and PyPy are not supported.

lxml

It is recommended that lxml is obtained through an installer or pre-built package. Windows packages are provided on https://pypi.python.org/pypi/lxml. Debian/Ubuntu users should install python3-lxml. For more information, see http://lxml.de/installation.html.

pip

If pip is not installed on your system yet, please follow the instructions at http://www.pip-installer.org/en/latest/installing.html to install pip. Note for Linux users, ensure you are executing the appropriate Python version when installing pip.

PhantomJS (Optional)

It is recommended to download a prebuilt binary build from http://phantomjs.org/download.html.

Usage

Intro

Wpull is a command line oriented program much like Wget. It is non-interactive and requires all options to specified on start up. If you are not familiar with Wget, please see the Wikipedia article on Wget.

Example Commands

To download the About page of Google.com:

wpull google.com/about

To archive a website:

wpull billy.blogsite.example \
    --warc-file blogsite-billy \
    --no-check-certificate \
    --no-robots --user-agent "InconspiuousWebBrowser/1.0" \
    --wait 0.5 --random-wait --waitretry 600 \
    --page-requisites --recursive --level inf \
    --span-hosts-allow linked-pages,page-requisites \
    --escaped-fragment --strip-session-id \
    --sitemaps \
    --reject-regex "/login\.php" \
    --tries 3 --retry-connrefused --retry-dns-error \
    --timeout 60 --session-timeout 21600 \
    --delete-after --database blogsite-billy.db \
    --quiet --output-file blogsite-billy.log

Wpull can also be invoked using:

python3 -m wpull

Stopping & Resuming

To gracefully stop Wpull, press CTRL+C (or send SIGINT). Wpull will quit once the current download has finished. To stop immediately, press CTRL+C again (or send SIGTERM).

If you have used the --database option, Wpull can reuse the existing database for resuming crawls. This behavior is different than --continue. Resuming with --continue is intended for resuming partially downloaded files while --database is intended for resuming partial crawls.

To resume a crawl provided you have used --database, simply reuse the same command options from the previous run. This will maintain the same behavior as the previous run. You may also tweak the options, for example, limit the recursion depth.

Note

When resuming downloads with --warc-file and --database, Wpull will overwrite the WARC file by default. This occurs because Wpull simply maintains a list of URLs that are fetched and not fetched. You should either rename the existing file manually or use the additional option --warc-append or move the files --warc-move.

Proxied Services

Wpull is able to use an HTTP proxy server to capture traffic from third-party programs such as PhantomJS. The requests will go through the proxy to Wpull’s HTTP client (which can be recorded with --warc-file).

Warning

Wpull uses the HTTP proxy insecurely on localhost.

It is possible for another user, on the same machine as Wpull, to send bogus requests to the HTTP proxy. Wpull, however, does not expose the HTTP proxy outside to the net by default.

It is not possible to use the proxy standalone at this time.

PhantomJS Integration

PhantomJS support is currently experimental.

--phantomjs will enable PhantomJS integration.

If a HTML document is encountered, Wpull will open the URL in PhantomJS. After the page is loaded, Wpull will try to scroll the page as specified by --phantomjs-scroll. Then, the HTML DOM source is scraped for URLs as normal. HTML and PDF snapshots are taken by default.

Currently, Wpull will not do anything else to manipulate the page such as clicking on links. As a consequence, Wpull with PhantomJS is not a complete solution for dynamic web pages yet!

Storing console logs and alert messages inside the WARC file is not yet supported.

youtube-dl Integration

youtube-dl support is currently experimental.

--youtube-dl will enable youtube-dl integration.

If a HTML document is encountered, Wpull will run youtube-dl on the URL. Wpull uses the options for downloading subtitles and thumbnails. Other options are at the default which may not grab the best possible quality. For example, youtube-dl may not grab the highest quality stream because it is not a simple video file.

It is not recommended to use recursion because it may fetch redundant amounts of data.

Storing manifests, metadata, or converted files inside the WARC file is not yet supported.

Options

Wget-compatible web downloader and crawler.

usage: wpull [-h] [-V] [--plugin-script FILE] [--plugin-args PLUGIN_ARGS]
             [--database FILE | --database-uri URI] [--concurrent N]
             [--debug-console-port PORT] [--debug-manhole]
             [--ignore-fatal-errors] [--monitor-disk MONITOR_DISK]
             [--monitor-memory MONITOR_MEMORY] [-o FILE | -a FILE]
             [-d | -v | -nv | -q | -qq] [--ascii-print]
             [--report-speed TYPE={bits}] [-i FILE] [-F] [-B URL]
             [--http-proxy HTTP_PROXY] [--https-proxy HTTPS_PROXY]
             [--proxy-user USER] [--proxy-password PASS] [--no-proxy]
             [--proxy-domains LIST] [--proxy-exclude-domains LIST]
             [--proxy-hostnames LIST] [--proxy-exclude-hostnames LIST]
             [-t NUMBER] [--retry-connrefused] [--retry-dns-error] [-O FILE]
             [-nc] [-c] [--progress TYPE={bar,dot,none}] [-N]
             [--no-use-server-timestamps] [-S] [-T SECONDS]
             [--dns-timeout SECS] [--connect-timeout SECS]
             [--read-timeout SECS] [--session-timeout SECS] [-w SECONDS]
             [--waitretry SECONDS] [--random-wait] [-Q NUMBER]
             [--bind-address ADDRESS] [--limit-rate RATE] [--no-dns-cache]
             [--rotate-dns] [--no-skip-getaddrinfo]
             [--restrict-file-names MODES=<ascii,lower,nocontrol,unix,upper,windows>]
             [-4 | -6 | --prefer-family FAMILY={IPv4,IPv6,none}] [--user USER]
             [--password PASSWORD] [--no-iri] [--local-encoding ENC]
             [--remote-encoding ENC] [--max-filename-length NUMBER] [-nd | -x]
             [-nH] [--protocol-directories] [-P PREFIX] [--cut-dirs NUMBER]
             [--http-user HTTP_USER] [--http-password HTTP_PASSWORD]
             [--no-cache] [--default-page NAME] [-E] [--ignore-length]
             [--header STRING] [--max-redirect NUMBER] [--referer URL]
             [--save-headers] [-U AGENT] [--no-robots] [--no-http-keep-alive]
             [--no-cookies] [--load-cookies FILE] [--save-cookies FILE]
             [--keep-session-cookies] [--post-data STRING | --post-file FILE]
             [--content-disposition] [--content-on-error] [--http-compression]
             [--html-parser {html5lib,libxml2-lxml}]
             [--link-extractors <css,html,javascript>] [--escaped-fragment]
             [--strip-session-id]
             [--secure-protocol PR={SSLv3,TLSv1,TLSv1.1,TLSv1.2,auto}]
             [--https-only] [--no-check-certificate] [--no-strong-crypto]
             [--certificate FILE] [--certificate-type TYPE={PEM}]
             [--private-key FILE] [--private-key-type TYPE={PEM}]
             [--ca-certificate FILE] [--ca-directory DIR]
             [--no-use-internal-ca-certs] [--random-file FILE]
             [--edg-file FILE] [--ftp-user USER] [--ftp-password PASS]
             [--no-remove-listing] [--no-glob] [--preserve-permissions]
             [--retr-symlinks [{0,1,no,off,on,yes}]] [--warc-file FILENAME]
             [--warc-append] [--warc-header STRING] [--warc-max-size NUMBER]
             [--warc-move DIRECTORY] [--warc-cdx] [--warc-dedup FILE]
             [--no-warc-compression] [--no-warc-digests] [--no-warc-keep-log]
             [--warc-tempdir DIRECTORY] [-r] [-l NUMBER] [--delete-after] [-k]
             [-K] [-p] [--page-requisites-level NUMBER] [--sitemaps] [-A LIST]
             [-R LIST] [--accept-regex REGEX] [--reject-regex REGEX]
             [--regex-type TYPE={pcre}] [-D LIST] [--exclude-domains LIST]
             [--hostnames LIST] [--exclude-hostnames LIST] [--follow-ftp]
             [--follow-tags LIST] [--ignore-tags LIST]
             [-H | --span-hosts-allow LIST=<linked-pages,page-requisites>]
             [-L] [-I LIST] [--trust-server-names] [-X LIST] [-np]
             [--no-strong-redirects] [--proxy-server]
             [--proxy-server-address ADDRESS] [--proxy-server-port PORT]
             [--phantomjs] [--phantomjs-exe PATH]
             [--phantomjs-max-time PHANTOMJS_MAX_TIME]
             [--phantomjs-scroll NUM] [--phantomjs-wait SEC]
             [--no-phantomjs-snapshot] [--no-phantomjs-smart-scroll]
             [--youtube-dl] [--youtube-dl-exe PATH]
             [URL [URL ...]]
Positional arguments:
urls the URL to be downloaded
Options:
-V, --version show program’s version number and exit
--plugin-script
 load plugin script from FILE
--plugin-args arguments for the plugin
--database save database tables into FILE instead of memory
--database-uri save database tables at SQLAlchemy URI instead of memory
--concurrent run at most N downloads at the same time
--debug-console-port
 run a web debug console at given port number
--debug-manhole
 install Manhole debugging socket
--ignore-fatal-errors
 ignore all internal fatal exception errors
--monitor-disk pause if minimum free disk space is exceeded
--monitor-memory
 pause if minimum free memory is exceeded
-o, --output-file
 write program messages to FILE
-a, --append-output
 append program messages to FILE
-d, --debug print debugging messages
-v, --verbose print informative program messages and detailed progress
-nv, --no-verbose
 print informative program messages and errors
-q, --quiet print program error messages
-qq, --very-quiet
 do not print program messages unless critical
--ascii-print print program messages in ASCII only
--report-speed

print speed in bits only instead of human formatted units

Possible choices: bits

-i, --input-file
 download URLs listed in FILE
-F, --force-html
 read URL input files as HTML files
-B, --base resolves input relative URLs to URL
--http-proxy HTTP proxy for HTTP requests
--https-proxy HTTP proxy for HTTPS requests
--proxy-user username for proxy “basic” authentication
--proxy-password
 password for proxy “basic” authentication
--no-proxy disable proxy support
--proxy-domains
 use proxy only from LIST of hostname suffixes
--proxy-exclude-domains
 don’t use proxy only from LIST of hostname suffixes
--proxy-hostnames
 use proxy only from LIST of hostnames
--proxy-exclude-hostnames
 don’t use proxy only from LIST of hostnames
-t, --tries try NUMBER of times on transient errors
--retry-connrefused
 retry even if the server does not accept connections
--retry-dns-error
 retry even if DNS fails to resolve hostname
-O, --output-document
 stream every document into FILE
-nc, --no-clobber
 don’t use anti-clobbering filenames
-c, --continue resume downloading a partially-downloaded file
--progress

choose the type of progress indicator

Possible choices: dot, bar, none

-N, --timestamping
 only download files that are newer than local files
--no-use-server-timestamps
 don’t set the last-modified time on files
-S, --server-response
 print the protocol responses from the server
-T, --timeout set DNS, connect, read timeout options to SECONDS
--dns-timeout timeout after SECS seconds for DNS requests
--connect-timeout
 timeout after SECS seconds for connection requests
--read-timeout timeout after SECS seconds for reading requests
--session-timeout
 timeout after SECS seconds for downloading files
-w, --wait wait SECONDS seconds between requests
--waitretry wait up to SECONDS seconds on retries
--random-wait randomly perturb the time between requests
-Q, --quota stop after downloading NUMBER bytes
--bind-address bind to ADDRESS on the local host
--limit-rate limit download bandwidth to RATE
--no-dns-cache disable caching of DNS lookups
--rotate-dns use different resolved IP addresses on requests
--no-skip-getaddrinfo
 always use the OS’s name resolver interface
--restrict-file-names
 

list of safe filename modes to use

Possible choices: unix, nocontrol, ascii, windows, lower, upper

-4, --inet4-only
 connect to IPv4 addresses only
-6, --inet6-only
 connect to IPv6 addresses only
--prefer-family
 

prefer to connect to FAMILY IP addresses

Possible choices: none, IPv6, IPv4

--user username for both FTP and HTTP authentication
--password password for both FTP and HTTP authentication
--no-iri use ASCII encoding only
--local-encoding
 use ENC as the encoding of input files and options
--remote-encoding
 force decoding documents using codec ENC
--max-filename-length
 limit filename length to NUMBER characters
-nd, --no-directories
 don’t create directories
-x, --force-directories
 always create directories
-nH, --no-host-directories
 don’t create directories for hostnames
--protocol-directories
 create directories for URL schemes
-P, --directory-prefix
 save everything under the directory PREFIX
--cut-dirs don’t make NUMBER of leading directories
--http-user username for HTTP authentication
--http-password
 password for HTTP authentication
--no-cache request server to not use cached version of files
--default-page use NAME as index page if not known
-E, --adjust-extension
 append HTML or CSS file extension if needed
--ignore-length
 ignore any Content-Length provided by the server
--header adds STRING to the HTTP header
--max-redirect follow only up to NUMBER document redirects
--referer always use URL as the referrer
--save-headers include server header responses in files
-U, --user-agent
 use AGENT instead of Wpull’s user agent
--no-robots ignore robots.txt directives
--no-http-keep-alive
 disable persistent HTTP connections
--no-cookies disables HTTP cookie support
--load-cookies load Mozilla cookies.txt from FILE
--save-cookies save Mozilla cookies.txt to FILE
--keep-session-cookies
 include session cookies when saving cookies to file
--post-data use POST for all requests with query STRING
--post-file use POST for all requests with query in FILE
--content-disposition
 use filename given in Content-Disposition header
--content-on-error
 keep error pages
--http-compression
 request servers to use HTTP compression
--html-parser

select HTML parsing library and strategy

Possible choices: libxml2-lxml, html5lib

--link-extractors
 

specify which link extractors to use

Possible choices: css, html, javascript

--escaped-fragment
 rewrite links with hash fragments to escaped fragments
--strip-session-id
 remove session ID tokens from links
--secure-protocol
 

specify the version of the SSL protocol to use

Possible choices: SSLv3, TLSv1, TLSv1.1, TLSv1.2, auto

--https-only download only HTTPS URLs
--no-check-certificate
 don’t validate SSL server certificates
--no-strong-crypto
 don’t use secure protocols/ciphers
--certificate use FILE containing the local client certificate
--certificate-type
 

Undocumented

Possible choices: PEM

--private-key use FILE containing the local client private key
--private-key-type
 

Undocumented

Possible choices: PEM

--ca-certificate
 load and use CA certificate bundle from FILE
--ca-directory load and use CA certificates from DIR
--no-use-internal-ca-certs
 don’t use CA certificates included with Wpull
--random-file use data from FILE to seed the SSL PRNG
--edg-file connect to entropy gathering daemon using socket FILE
--ftp-user username for FTP login
--ftp-password password for FTP login
--no-remove-listing
 keep directory file listings
--no-glob don’t use filename glob patterns on FTP URLs
--preserve-permissions
 apply server’s Unix file permissions on downloaded files
--retr-symlinks
 

if disabled, preserve symlinks and run with security risks

Possible choices: yes, on, 1, off, no, 0

--warc-file save WARC file to filename prefixed with FILENAME
--warc-append append instead of overwrite the output WARC file
--warc-header include STRING in WARC file metadata
--warc-max-size
 write sequential WARC files sized about NUMBER bytes
--warc-move move WARC files to DIRECTORY as they complete
--warc-cdx write CDX file along with the WARC file
--warc-dedup write revisit records using digests in FILE
--no-warc-compression
 do not compress the WARC file
--no-warc-digests
 do not compute and save SHA1 hash digests
--no-warc-keep-log
 do not save a log into the WARC file
--warc-tempdir use temporary DIRECTORY for preparing WARC files
-r, --recursive
 follow links and download them
-l, --level limit recursion depth to NUMBER
--delete-after download files temporarily and delete them after
-k, --convert-links
 rewrite links in files that point to local files
-K, --backup-converted
 save original files before converting their links
-p, --page-requisites
 download objects embedded in pages
--page-requisites-level
 limit page-requisites recursion depth to NUMBER
--sitemaps download Sitemaps to discover more links
-A, --accept download only files with suffix in LIST
-R, --reject don’t download files with suffix in LIST
--accept-regex download only URLs matching REGEX
--reject-regex don’t download URLs matching REGEX
--regex-type

use regex TYPE

Possible choices: pcre

-D, --domains download only from LIST of hostname suffixes
--exclude-domains
 don’t download from LIST of hostname suffixes
--hostnames download only from LIST of hostnames
--exclude-hostnames
 don’t download from LIST of hostnames
--follow-ftp follow links to FTP sites
--follow-tags follow only links contained in LIST of HTML tags
--ignore-tags don’t follow links contained in LIST of HTML tags
-H, --span-hosts
 follow links and page requisites to other hostnames
--span-hosts-allow
 

selectively span hosts for resource types in LIST

Possible choices: page-requisites, linked-pages

-L, --relative follow only relative links
-I, --include-directories
 download only paths in LIST
--trust-server-names
 use the last given URL for filename during redirects
-X, --exclude-directories
 don’t download paths in LIST
-np, --no-parent
 don’t follow to parent directories on URL path
--no-strong-redirects
 don’t implicitly allow span hosts for redirects
--proxy-server run HTTP proxy server for capturing requests
--proxy-server-address
 bind the proxy server to ADDRESS
--proxy-server-port
 bind the proxy server port to PORT
--phantomjs use PhantomJS for loading dynamic pages
--phantomjs-exe
 path of PhantomJS executable
--phantomjs-max-time
 maximum duration of PhantomJS session
--phantomjs-scroll
 scroll the page up to NUM times
--phantomjs-wait
 wait SEC seconds between page interactions
--no-phantomjs-snapshot
 don’t take dynamic page snapshots
--no-phantomjs-smart-scroll
 always scroll the page to maximum scroll count option
--youtube-dl use youtube-dl for downloading videos
--youtube-dl-exe
 path of youtube-dl executable

Defaults may differ depending on the operating system. Use --help to see them.

This is only a programmatically generated listing from the program. In most cases, you can follow Wget’s documentation options. Wpull will follow Wget’s behavior so please check Wget online documentation and resources before asking questions.

Differences between Wpull and Wget

In most cases, Wpull can be substituted with Wget easily. However, some options may not be implemented yet. This section describes the reasons for option differences.

Missing in Wpull

  • --background
  • --execute
  • --config
  • --spider
  • --ignore-case
  • --ask-password
  • --unlink
  • --method
  • --body-data
  • --body-file
  • --auth-no-challenge: Temporarily on by default, but specifying the option is not yet available. Digest authentication is not yet supported.
  • --no-passive-ftp
  • --mirror
  • --strict-comments: No plans for support of this option.
  • --regex-type=posix: No plans to support posix regex.
  • Features greater than Wget 1.15.

Missing in Wget

  • --plugin-plugin: This provides scripting hooks.
  • --plugin-args
  • --database: Enables the use of the on-disk database.
  • --database-uri
  • --concurrent: Allows changing the number of downloads that happen at once.
  • --debug-console-port
  • --debug-manhole
  • --ignore-fatal-errors
  • --monitor-disk: Avoids filling the disk.
  • --monitor-memory
  • --very-quiet
  • --ascii-print: Force replaces Unicode text with escaped values for environments that are ASCII only.
  • --http-proxy:
  • --https-proxy
  • --proxy-domains
  • --proxy-exclude-domains
  • --proxy-hostnames
  • --proxy-exclude-hostnames
  • --retry-dns-error: Wget considers DNS errors as non-recoverable.
  • --session-timeout: Abort downloading infinite MP3 streams.
  • --no-skip-getaddrinfo
  • --no-robots: Wpull is designed for archiving.
  • --http-compression (gzip, deflate, & raw deflate)
  • --html-parser: HTML parsing libraries have many trade-offs. Pick any two: small, fast, reliable.
  • --link-extractors
  • --escaped-fragment: Try to force HTML rendering instead of Javascript.
  • --strip-session-id
  • --no-strong-crypto
  • --no-use-internal-ca-certs
  • --warc-append
  • --warc-move: Move WARC files out of the way for resuming a crashed crawl.
  • --page-requisites-level: Prevent infinite downloading of misconfurged server resources such as HTML served under a image.
  • --sitemaps: Discover more URLs.
  • --hostnames: Wget simply matches the endings when using --domains instead of matching each part of the hostname.
  • --exclude-hostnames
  • --span-hosts-allow: Allow fetching things such as images hosted on another domain.
  • --no-strong-redirects
  • --proxy-server
  • --proxy-server-address
  • --proxy-server-port
  • --phantomjs
  • --phantomjs-exe
  • --phantomjs-max-time
  • --phantomjs-scroll
  • --phantomjs-wait
  • --no-phantomjs-snapshot
  • --no-phantomjs-smart-scroll
  • --youtube-dl
  • --youtube-dl-exe

Help

Frequently Asked Questions

What does it mean by “Wget-compatible”?

It means that Wpull behaves similarly to Wget, but the internal machinery that powers Wpull is completely different from Wget.

What advantages does Wpull offer over Wget?

The motivation for the development of Wpull is to find a replacement for Wget that does not store URLs in memory and is scriptable.

Wpull has support for using a on-disk database so memory requirements remain constant. Wget only stores URLs in memory, Wget will eventually run out of memory if you want to crawl millions of URLs at once.

Another motivation is to provide hooks that accept/reject URLs during the crawl.

What advantages does Wget offer over Wpull?

Wget is much more mature and stable. With many developers working on Wget, bug fixes and features arrive faster.

Wget is also written in C which can handle text much faster. Wpull is written in Python which was not designed for blazing fast processing of data. This means that Wpull can be slow processing large documents.

How can change things while it is running? / Is there a GUI or web interface to make things easier?

Wpull does not offer a user-friendly interface to make changes as it runs at this time. However, please check out https://github.com/ludios/grab-site which is a web interface built on of Wpull

Wpull is giving an error or not performing correctly.

Check that you have the options correct. In most cases, it is a misunderstanding of Wget options.

Otherwise if Wpull is not doing what you want, please visit the issue tracker and see if your issue is there. If not, please inform the developers by creating a new issue.

When you open a new issue, GitHub provides a link to the guidelines document. Please read it to learn how to file a good bug report.

How can I help the development of Wpull? What are the development goals?

Please visit the [GitHub repository](https://github.com/chfoo/wpull). From there, you can take a look at:

  • The Contributing file for specific instructions on how to help
  • The issue tracker for current bugs and features
  • The Wiki for the roadmap of the project such as goals and statuses
  • And the code, of course

How can I chat or ask a question?

For chatting and quick questions, please visit the “unoffical” IRC channel: #archiveteam-bs on EFNet. (Click here if you do not have an IRC client.)

Alternatively if the discussion is lengthy, please use the issue tracker as described above. As a courtesy, if your question is answered on the issue tracker, please close the issue to mark your question as solved.

We highly prefer that you use IRC or the issue tracker. But email is also available: chris.foo@gmail.com

Plugin Scripting Hooks

Wpull’s scripting support is modelled after alard’s Wget with Lua hooks.

Scripts are installed using the YAPSY plugin architecture. To create your plugin script, subclass wpull.application.plugin.WpullPlugin and load it with --plugin-script option.

The plugin interface provides two type of callbacks: hooks and events.

Hook

Hooks change the behavior of the program. When the callback is registered to the hook, it is required to provide a return value typically one of wpull.application.hook.Actions. Only one callback may be registered to a hook.

To register your callback, decorate your callback with wpull.application.plugin.hook().

Event

Events are points in the program that occur and are notified to registered listeners.

To register your callback, decorate your callback with wpull.application.plugin.event().

Interfaces

The global hooks and events constants are located at wpull.application.plugin.PluginFunctions.

PluginFunctions.accept_url
hook Interface: FetchRule.plugin_accept_url
PluginFunctions.dequeued_url
event Interface: URLTableHookWrapper.dequeued_url
PluginFunctions.exit_status
hook Interface: AppStopTask.plugin_exit_status
PluginFunctions.finishing_statistics
event Interface: StatsStopTask.plugin_finishing_statistics
PluginFunctions.get_urls
event Interface: ProcessingRule.plugin_get_urls
PluginFunctions.handle_error
hook Interface: ResultRule.plugin_handle_error
PluginFunctions.handle_pre_response
hook Interface: ResultRule.plugin_handle_pre_response
PluginFunctions.handle_response
hook Interface: ResultRule.plugin_handle_response
PluginFunctions.queued_url
event Interface: URLTableHookWrapper.queued_url
PluginFunctions.resolve_dns
hook Interface: Resolver.resolve_dns
PluginFunctions.resolve_dns_result
event Interface: Resolver.resolve_dns_result
PluginFunctions.wait_time
hook Interface: ResultRule.plugin_wait_time

Example

Here is a example Python script. It

  • Prints hello on start up
  • Refuses to download anything with the word “dog” in the URL
  • Scrapes URLs on a hypothetical homepage
  • Stops the program execution when the server returns HTTP 429
import datetime
import re

from wpull.application.hook import Actions
from wpull.application.plugin import WpullPlugin, PluginFunctions, hook
from wpull.protocol.abstract.request import BaseResponse
from wpull.pipeline.session import ItemSession


class MyExamplePlugin(WpullPlugin):
    def activate(self):
        super().activate()
        print('Hello world!')

    def deactivate(self):
        super().deactivate()
        print('Goodbye world!')

    @hook(PluginFunctions.accept_url)
    def my_accept_func(self, item_session: ItemSession, verdict: bool, reasons: dict) -> bool:
        return 'dog' not in item_session.request.url

    @event(PluginFunctions.get_urls)
    def my_get_urls(self, item_session: ItemSession):
        if item_session.request.url_info.path != '/':
            return

        matches = re.finditer(
            r'<div id="profile-(\w+)"', item_session.response.body.content
        )
        for match in matches:
            url = 'http://example.com/profile.php?username={}'.format(
                match.group(1)
            )
            item_session.add_child_url(url)

    @hook(PluginFunctions.handle_response)
    def my_handle_response(item_session: ItemSession):
        if item_session.response.response_code == 429:
            return Actions.STOP

API

Wpull was designed as a command line program and most users do not need to read this section. However, you may be using the scripting hook interface or you may want to reuse a component.

Since Wpull is generally not a library, API backwards compatibility is provided on a best-effort basis; there is no guarantee on whether public or private functions will remain the same. This rule does not include the scripting hook interface which is designed for backwards compatibility.

Here lists all documented classes and functions. Not all members are documented yet. Some members, such as the backported modules, are not documented here.

If the documentation is not sufficient, please take a look at the source code. Suggestions and improvements are welcomed.

Note

The API is not thread-safe. It is intended to be run asynchronously with Asyncio.

Many functions also are decorated with the asyncio.coroutine() decorator. For more information, see https://docs.python.org/3/library/asyncio.html.

wpull Package

application Module

application.app Module

Application main interface.

class wpull.application.app.Application(pipeline_series: wpull.pipeline.pipeline.PipelineSeries)[source]

Bases: wpull.application.hook.HookableMixin

Default non-interactive application user interface.

This class manages process signals and displaying warnings.

ERROR_CODE_MAP = OrderedDict([(<class 'wpull.errors.AuthenticationError'>, 6), (<class 'wpull.errors.ServerError'>, 8), (<class 'wpull.errors.ProtocolError'>, 7), (<class 'wpull.errors.SSLVerificationError'>, 5), (<class 'wpull.errors.DNSNotFound'>, 4), (<class 'wpull.errors.ConnectionRefused'>, 4), (<class 'wpull.errors.NetworkError'>, 4), (<class 'OSError'>, 3)])

Mapping of error types to exit status.

EXPECTED_EXCEPTIONS = (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.DNSNotFound'>, <class 'wpull.errors.ConnectionRefused'>, <class 'wpull.errors.NetworkError'>, <class 'OSError'>, <class 'OSError'>, <class 'wpull.application.hook.HookStop'>, <class 'StopIteration'>, <class 'SystemExit'>, <class 'KeyboardInterrupt'>)

Exception classes that are not crashes.

class Event[source]

Bases: enum.Enum

Application.exit_code
Application.run()[source]
Application.run_sync() → int[source]

Run the application.

This function is blocking.

Returns:The exit status.
Return type:int
Application.setup_signal_handlers()[source]

Setup Ctrl+C and SIGTERM handlers.

Application.stop()[source]
Application.update_exit_code(code: int)[source]

Set the exit code if it is serious than before.

Parameters:code – The exit code.
class wpull.application.app.ApplicationState[source]

Bases: enum.Enum

application.builder Module

Application support.

class wpull.application.builder.Builder(args, unit_test=False)[source]

Bases: object

Application builder.

Parameters:args – Options from argparse.ArgumentParser
build() → wpull.application.app.Application[source]

Put the application together.

build_and_run()[source]

Build and run the application.

Returns:The exit status.
Return type:int
factory

Return the Factory.

Returns:An factory.Factory instance.
Return type:Factory
get_stderr()[source]

Return stderr or something else if under unit testing.

application.factory Module

Instance creation and management.

class wpull.application.factory.Factory(class_map=None)[source]

Bases: collections.abc.Mapping, object

Allows selection of classes and keeps track of instances.

This class behaves like a mapping. Keys are names of classes and values are instances.

class_map

A mapping of names to class types.

instance_map

A mapping of names to instances.

is_all_initialized()[source]

Return whether all the instances have been initialized.

Returns:bool
new(name, *args, **kwargs)[source]

Create an instance.

Parameters:
  • name (str) – The name of the class
  • args – The arguments to pass to the class.
  • kwargs – The keyword arguments to pass to the class.
Returns:

instance

set(name, class_)[source]

Set the callable or class to be used.

Parameters:
  • name (str) – The name of the class.
  • class – The class or a callable factory function.

application.hook Module

Python and Lua scripting support.

See Plugin Scripting Hooks for an introduction.

class wpull.application.hook.Actions[source]

Bases: enum.Enum

Actions for handling responses and errors.

NORMAL

normal

Use Wpull’s original behavior.

RETRY

retry

Retry this item (as if an error has occurred).

FINISH

finish

Consider this item as done; don’t do any further processing on it.

STOP

stop

Raises HookStop to stop the Engine from running.

class wpull.application.hook.EventDispatcher[source]

Bases: collections.abc.Mapping

add_listener(name: str, callback)[source]
is_registered(name: str) → bool[source]
notify(name: str, *args, **kwargs)[source]
register(name: str)[source]
remove_listener(name: str, callback)[source]
unregister(name: str)[source]
exception wpull.application.hook.HookAlreadyConnectedError[source]

Bases: ValueError

A callback is already connected to the hook.

exception wpull.application.hook.HookDisconnected[source]

Bases: RuntimeError

No callback is connected.

class wpull.application.hook.HookDispatcher(event_dispatcher_transclusion: typing.Union=None)[source]

Bases: collections.abc.Mapping

Dynamic callback hook system.

call(name: str, *args, **kwargs)[source]

Invoke the callback.

call_async(name: str, *args, **kwargs)[source]

Invoke the callback.

connect(name, callback)[source]

Add callback to hook.

disconnect(name: str)[source]

Remove callback from hook.

is_connected(name: str) → bool[source]

Return whether the hook is connected.

is_registered(name: str) → bool[source]
register(name: str)[source]

Register hooks that can be connected.

unregister(name: str)[source]

Unregister hook.

exception wpull.application.hook.HookStop[source]

Bases: Exception

Stop the engine.

Raise this exception as a more graceful alternative to sys.exit().

class wpull.application.hook.HookableMixin[source]

Bases: object

connect_plugin(plugin: wpull.application.plugin.WpullPlugin)[source]

application.main Module

wpull.application.main.main(exit=True, install_tornado_bridge=True, use_signals=True)[source]

application.options Module

Program options.

class wpull.application.options.AppArgumentParser(*args, real_exit=True, **kwargs)[source]

Bases: argparse.ArgumentParser

An Argument Parser that builds up the application options.

classmethod comma_choice_list(string)[source]

Convert a comma separated string to CommaChoiceListArgs.

classmethod comma_list(string)[source]

Convert a comma separated string to list.

exit(status=0, message=None)[source]
classmethod get_argv_encoding(argv)[source]
classmethod int_0_inf(string)[source]

Convert string to int.

If inf is supplied, it returns 0.

classmethod int_bytes(string)[source]

Convert string describing size to int.

parse_args(args=None, namespace=None)[source]
class wpull.application.options.AppHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)[source]

Bases: argparse.HelpFormatter

class wpull.application.options.CommaChoiceListArgs[source]

Bases: frozenset

Specialized frozenset.

This class overrides the __contains__ function to allow the use of the in operator for ArgumentParser’s choices checking for comma separated lists. The function behaves differently only when the objects compared are CommaChoiceListArgs.

application.plugin Module

class wpull.application.plugin.InterfaceRegistry[source]

Bases: collections.abc.Mapping

register(name: typing.Any, interface: typing.Any, category: wpull.application.plugin.PluginFunctionCategory)[source]
wpull.application.plugin.PluginClientFunctionInfo

alias of _PluginClientFunctionInfo

class wpull.application.plugin.PluginFunctionCategory[source]

Bases: enum.Enum

class wpull.application.plugin.PluginFunctions[source]

Bases: enum.Enum

class wpull.application.plugin.WpullPlugin[source]

Bases: yapsy.IPlugin.IPlugin

get_plugin_functions() → typing.Iterator[source]
should_activate() → bool[source]
wpull.application.plugin.event(name: typing.Any)[source]
wpull.application.plugin.event_interface(name: typing.Any, interface_registry: wpull.application.plugin.InterfaceRegistry=<wpull.application.plugin.InterfaceRegistry object>)[source]
wpull.application.plugin.hook(name: typing.Any)[source]
wpull.application.plugin.hook_interface(name: typing.Any, interface_registry: wpull.application.plugin.InterfaceRegistry=<wpull.application.plugin.InterfaceRegistry object>)[source]

application.plugins Module

application.plugins.arg_warning.plugin Module

application.plugins.debug_console.plugin Module

application.plugins.download_progress.plugin Module

application.plugins.server_response.plugin Module

application.tasks Module

application.tasks.conversion Module

class wpull.application.tasks.conversion.LinkConversionSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.conversion.LinkConversionTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.application.tasks.conversion.QueuedFileSession)[source]
class wpull.application.tasks.conversion.QueuedFileSession(app_session: wpull.pipeline.app.AppSession, file_id: int, url_record: wpull.pipeline.item.URLRecord)[source]

Bases: object

class wpull.application.tasks.conversion.QueuedFileSource(app_session: wpull.pipeline.app.AppSession)[source]

Bases: wpull.pipeline.pipeline.ItemSource

get_item() → typing.Union[source]

application.tasks.database Module

class wpull.application.tasks.database.DatabaseSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.database.InputURLTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

application.tasks.download Module

class wpull.application.tasks.download.BackgroundAsyncTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.session.ItemSession)[source]
class wpull.application.tasks.download.CheckQuotaTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.session.ItemSession)[source]
class wpull.application.tasks.download.ClientSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.download.CoprocessorSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.download.ParserSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.download.ProcessTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.session.ItemSession)[source]
class wpull.application.tasks.download.ProcessorSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.download.ProxyServerSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

Build MITM proxy server.

application.tasks.log Module

class wpull.application.tasks.log.LoggingSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.log.LoggingShutdownTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

application.tasks.network Module

class wpull.application.tasks.network.NetworkSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

application.tasks.plugin Module

class wpull.application.tasks.plugin.PluginLocator(directories, paths)[source]

Bases: yapsy.IPluginLocator.IPluginLocator

gatherCorePluginInfo(directory, filename)[source]
locatePlugins()[source]
class wpull.application.tasks.plugin.PluginSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

application.tasks.resmon Module

class wpull.application.tasks.resmon.ResmonSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.resmon.ResmonSleepTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.session.ItemSession)[source]

application.tasks.rule Module

class wpull.application.tasks.rule.URLFiltersPostURLImportSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.rule.URLFiltersSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

application.tasks.shutdown Module

class wpull.application.tasks.shutdown.AppStopTask[source]

Bases: wpull.pipeline.pipeline.ItemTask, wpull.application.hook.HookableMixin

static plugin_exit_status(app_session: wpull.pipeline.app.AppSession, exit_code: int) → int[source]

Return the program exit status code.

Exit codes are values from errors.ExitStatus.

Parameters:exit_code – The exit code Wpull wants to return.
Returns:The exit code that Wpull will return.
Return type:int
process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.shutdown.BackgroundAsyncCleanupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.shutdown.CookieJarTeardownTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

application.tasks.sslcontext Module

class wpull.application.tasks.sslcontext.SSLContextTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

application.tasks.stats Module

class wpull.application.tasks.stats.StatsStartTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.stats.StatsStopTask[source]

Bases: wpull.pipeline.pipeline.ItemTask, wpull.application.hook.HookableMixin

static plugin_finishing_statistics(app_session: wpull.pipeline.app.AppSession, statistics: wpull.stats.Statistics)[source]

Callback containing final statistics.

Parameters:
  • start_time (float) – timestamp when the engine started
  • end_time (float) – timestamp when the engine stopped
  • num_urls (int) – number of URLs downloaded
  • bytes_downloaded (int) – size of files downloaded in bytes
process(session: wpull.pipeline.app.AppSession)[source]

application.tasks.warc Module

class wpull.application.tasks.warc.WARCRecorderSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.warc.WARCRecorderTeardownTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]
class wpull.application.tasks.warc.WARCVisitsTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

Populate the visits from the CDX into the URL table.

application.tasks.writer Module

class wpull.application.tasks.writer.FileWriterSetupTask[source]

Bases: wpull.pipeline.pipeline.ItemTask

process(session: wpull.pipeline.app.AppSession)[source]

body Module

Request and response payload.

class wpull.body.Body(file=None, directory=None, hint='lone_body')[source]

Bases: object

Represents the document/payload of a request or response.

This class is a wrapper around a file object. Methods are forwarded to the underlying file object.

file

file

The file object.

Parameters:
  • file (file, optional) – Use the given file as the file object.
  • directory (str) – If file is not given, use directory for a new temporary file.
  • hint (str) – If file is not given, use hint as a filename infix.
content()[source]

Return the content of the file.

If this function is invoked, the contents of the entire file is read and cached.

Returns:The entire content of the file.
Return type:bytes
size()[source]

Return the size of the file.

to_dict()[source]

Convert the body to a dict.

Returns:The items are:
  • filename (string, None): The path of the file.
  • length (int, None): The size of the file.
Return type:dict
wpull.body.is_seekable(file)[source]
wpull.body.new_temp_file(directory=None, hint='')[source]

Return a new temporary file.

cache Module

Caching.

class wpull.cache.BaseCache[source]

Bases: collections.abc.Mapping, object

clear()[source]

Remove all items in cache.

wpull.cache.Cache

alias of LRUCache

class wpull.cache.CacheItem(key, value, time_to_live=None, access_time=None)[source]

Bases: object

Info about an item in the cache.

Parameters:
  • key – The key
  • value – The value
  • time_to_live – The time in seconds of how long to keep the item
  • access_time – The timestamp of the last use of the item
expire_time

When the item expires.

class wpull.cache.FIFOCache(max_items=None, time_to_live=None)[source]

Bases: wpull.cache.BaseCache

First in first out object cache.

Parameters:
  • max_items (int) – The maximum number of items to keep.
  • time_to_live (float) – Discard items after time_to_live seconds.

Reusing a key to update a value will not affect the expire time of the item.

clear()[source]
trim()[source]

Remove items that are expired or exceed the max size.

class wpull.cache.LRUCache(max_items=None, time_to_live=None)[source]

Bases: wpull.cache.FIFOCache

Least recently used object cache.

Parameters:
  • max_items – The maximum number of items to keep
  • time_to_live – The time in seconds of how long to keep the item
touch(key)[source]
wpull.cache.total_ordering(obj)

collections Module

Data structures.

class wpull.collections.FrozenDict(orig_dict)[source]

Bases: collections.abc.Mapping, collections.abc.Hashable

Immutable mapping wrapper.

hash_cache
orig_dict
class wpull.collections.LinkedList[source]

Bases: object

Doubly linked list.

map

dict

A mapping of values to nodes.

head

LinkedListNode

The first node.

tail

LinkedListNode

The last node.

append(value)[source]
appendleft(value)[source]
clear()[source]
index(value)[source]
pop()[source]
popleft()[source]
remove(value)[source]
remove_node(node)[source]
class wpull.collections.LinkedListNode(value, head=None, tail=None)[source]

Bases: object

A node in a LinkedList.

value

Any value.

head

LinkedListNode

The node in front.

tail

LinkedListNode

The node in back.

head

Add a node to the head.

Add a node to the tail.

tail

Remove this node and link any head or tail.

value
class wpull.collections.OrderedDefaultDict(default_factory=None, *args, **kwargs)[source]

Bases: collections.OrderedDict

An ordered default dict.

http://stackoverflow.com/a/6190500/1524507

copy()[source]

converter Module

Document content post-processing.

class wpull.converter.BaseDocumentConverter[source]

Bases: object

Base class for classes that convert links within a document.

convert(input_filename, output_filename, base_url=None)[source]
class wpull.converter.BatchDocumentConverter(html_parser, element_walker, url_table, backup=False)[source]

Bases: object

Convert all documents in URL table.

Parameters:
  • url_table – An instance of database.URLTable.
  • backup (bool) – Whether back up files are created.
convert_all()[source]

Convert all links in URL table.

convert_by_record(url_record)[source]

Convert using given URL Record.

class wpull.converter.CSSConverter(url_table)[source]

Bases: wpull.scraper.css.CSSScraper, wpull.converter.BaseDocumentConverter

CSS converter.

convert(input_filename, output_filename, base_url=None)[source]
convert_text(text, base_url=None)[source]
get_new_url(url, base_url=None)[source]
class wpull.converter.HTMLConverter(html_parser, element_walker, url_table)[source]

Bases: wpull.scraper.html.HTMLScraper, wpull.converter.BaseDocumentConverter

HTML converter.

convert(input_filename, output_filename, base_url=None)[source]

cookie Module

HTTP Cookies.

class wpull.cookie.BetterMozillaCookieJar(filename=None, delayload=False, policy=None)[source]

Bases: http.cookiejar.FileCookieJar

MozillaCookieJar that is compatible with Wget/Curl.

It ignores file header checks and supports session cookies.

header = '# Netscape HTTP Cookie File\n# http://curl.haxx.se/rfc/cookie_spec.html\n# This is a generated file! Do not edit.\n\n'
magic_re = re.compile('.')
save(filename=None, ignore_discard=False, ignore_expires=False)[source]
class wpull.cookie.DeFactoCookiePolicy(*args, **kwargs)[source]

Bases: http.cookiejar.DefaultCookiePolicy

Cookie policy that limits the content and length of the cookie.

Parameters:cookie_jar – The CookieJar instance.

This policy class is not designed to be shared between CookieJar instances.

cookie_length(domain)[source]

Return approximate length of all cookie key-values for a domain.

count_cookies(domain)[source]

Return the number of cookies for the given domain.

set_ok(cookie, request)[source]

cookiewrapper Module

Wrappers that wrap instances to Python standard library.

class wpull.cookiewrapper.CookieJarWrapper(cookie_jar, save_filename=None, keep_session_cookies=False)[source]

Bases: object

Wraps a CookieJar.

Parameters:
  • cookie_jar – An instance of http.cookiejar.CookieJar.
  • save_filename (str, optional) – A filename to save the cookies.
  • keep_session_cookies (bool) – If True, session cookies are kept when saving to file.

Wrapped add_cookie_header.

Parameters:
  • request – An instance of http.request.Request.
  • referrer_host (str) – An hostname or IP address of the referrer URL.
close()[source]

Save the cookie jar if needed.

cookie_jar

Return the wrapped Cookie Jar.

extract_cookies(response, request, referrer_host=None)[source]

Wrapped extract_cookies.

Parameters:
class wpull.cookiewrapper.HTTPResponseInfoWrapper(response)[source]

Bases: object

Wraps a HTTP Response.

Parameters:response – An instance of http.request.Response
info()[source]

Return the header fields as a Message:

Returns:An instance of email.message.Message. If Python 2, returns an instance of mimetools.Message.
Return type:Message
wpull.cookiewrapper.convert_http_request(request, referrer_host=None)[source]

Convert a HTTP request.

Parameters:
  • request – An instance of http.request.Request.
  • referrer_host (str) – The referrering hostname or IP address.
Returns:

An instance of urllib.request.Request

Return type:

Request

database Module

Storage for tracking URLs.

database.base Module

Base table class.

wpull.database.base.AddURLInfo

alias of _AddURLInfo

class wpull.database.base.BaseURLTable[source]

Bases: object

URL table.

add_many(new_urls: typing.Iterator) → typing.Iterator[source]

Add the URLs to the table.

Parameters:new_urls – URLs to be added.
Returns:The URLs added. Useful for tracking duplicates.
add_one(url: str, url_properties: typing.Union=None, url_data: typing.Union=None)[source]

Add a single URL to the table.

Parameters:
  • url – The URL to be added
  • url_properties – Additional values to be saved
  • url_data – Additional data to be saved
add_visits(visits)[source]

Add visited URLs from CDX file.

Parameters:visits (iterable) – An iterable of items. Each item is a tuple containing a URL, the WARC ID, and the payload digest.
check_in(url: str, new_status: wpull.pipeline.item.Status, increment_try_count: bool=True, url_result: typing.Union=None)[source]

Update record for processed URL.

Parameters:
  • url – The URL.
  • new_status – Update the item status to new_status.
  • increment_try_count – Whether to increment the try counter for the URL.
  • url_result – Additional values.
check_out(filter_status: wpull.pipeline.item.Status, filter_level: typing.Union=None) → wpull.pipeline.item.URLRecord[source]

Find a URL, mark it in progress, and return it.

Parameters:
  • filter_status – Gets first item with given status.
  • filter_level – Gets item with filter_level or lower.
Raises:

NotFound

close()[source]

Run any clean-up actions and close the table.

contains(url: str)[source]

Return whether the URL is in the table.

convert_check_in(file_id: int, status: wpull.pipeline.item.Status)[source]
convert_check_out() -> (<class 'int'>, <class 'wpull.pipeline.item.URLRecord'>)[source]
count() → int[source]

Return the number of URLs in the table.

This call may be expensive.

get_all() → typing.Iterator[source]

Return all URLRecord.

get_hostnames()[source]

Return list of hostnames

get_one(url: str) → wpull.pipeline.item.URLRecord[source]

Return a URLRecord for the URL.

Raises:NotFound
get_revisit_id(url, payload_digest)[source]

Return the WARC ID corresponding to the visit.

Returns:str, None
get_root_url_todo_count() → int[source]
release()[source]

Mark any in_progress URLs to todo status.

remove_many(urls)[source]

Remove the URLs from the database.

remove_one(url)[source]

Remove a URL from the database.

update_one(url, **kwargs)[source]

Arbitrarily update values for a URL.

exception wpull.database.base.DatabaseError[source]

Bases: Exception

Any database error.

exception wpull.database.base.NotFound[source]

Bases: wpull.database.base.DatabaseError

Item not found in the table.

database.sqlmodel Module

Database SQLAlchemy model.

wpull.database.sqlmodel.DBBase

alias of Base

class wpull.database.sqlmodel.QueuedURL(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

filename

Local filename of the item.

id
inline_level

Depth of the page requisite object. 0 is the object, 1 is the object’s dependency, etc.

level

Recursive depth of the item. 0 is root, 1 is child of root, etc.

Expected content type of extracted link.

parent_url

A descriptor that presents a read/write view of an object attribute.

parent_url_string
parent_url_string_id

Optional referral URL

post_data

Additional percent-encoded data for POST.

priority

Priority of item.

root_url

A descriptor that presents a read/write view of an object attribute.

root_url_string
root_url_string_id

Optional root URL

status

Status of the completion of the item.

status_code

HTTP status code or FTP rely code.

to_plain() → wpull.pipeline.item.URLRecord[source]
try_count

Number of attempts made in order to process the item.

url

A descriptor that presents a read/write view of an object attribute.

url_string
url_string_id

Target URL to fetch

classmethod watch_urls_inserted(session)[source]
class wpull.database.sqlmodel.URLString(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Table containing the URL strings.

The URL references this table.

classmethod add_urls(session, urls: typing.Iterable)[source]
id
url
class wpull.database.sqlmodel.WARCVisit(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

Standalone table for --cdx-dedup feature.

classmethod add_visits(session, visits)[source]
classmethod get_revisit_id(session, url, payload_digest)[source]
payload_digest
url
warc_id
class wpull.database.sqlmodel.Hostname(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

hostname
id
class wpull.database.sqlmodel.QueuedFile(**kwargs)[source]

Bases: sqlalchemy.ext.declarative.api.Base

id
queued_url
queued_url_id
status

database.sqltable Module

SQLAlchemy table implementations.

class wpull.database.sqltable.BaseSQLURLTable[source]

Bases: wpull.database.base.BaseURLTable

add_many(new_urls)[source]
add_visits(visits)[source]
check_in(url, new_status, increment_try_count=True, url_result=None)[source]
check_out(filter_status, level=None)[source]
convert_check_in(file_id, status)[source]
convert_check_out()[source]
count()[source]
get_all()[source]
get_hostnames()[source]
get_one(url)[source]
get_revisit_id(url, payload_digest)[source]
get_root_url_todo_count()[source]
release()[source]
remove_many(urls)[source]
update_one(url, **kwargs)[source]
class wpull.database.sqltable.SQLiteURLTable(path=':memory:')[source]

Bases: wpull.database.sqltable.BaseSQLURLTable

URL table with SQLite storage.

Parameters:path – A SQLite filename
close()[source]
class wpull.database.sqltable.GenericSQLURLTable(url)[source]

Bases: wpull.database.sqltable.BaseSQLURLTable

URL table using SQLAlchemy without any customizations.

Parameters:url – A SQLAlchemy database URL.
close()[source]
wpull.database.sqltable.URLTable

The default URL table implementation.

alias of SQLiteURLTable

wpull.database.sqltable.convert_dict_enum_values(dict_)[source]

database.wrap Module

URL table wrappers.

class wpull.database.wrap.URLTableHookWrapper(url_table)[source]

Bases: wpull.database.base.BaseURLTable, wpull.application.hook.HookableMixin

URL table wrapper with scripting hooks.

Parameters:url_table – URL table.
url_table

URL table.

add_many(urls)[source]
add_visits(visits)[source]
check_in(url, new_status, increment_try_count=True, url_result=None)[source]
check_out(filter_status, filter_level=None)[source]
close()[source]
convert_check_in(file_id: int, status: wpull.pipeline.item.Status)[source]
convert_check_out()[source]
count()[source]
static dequeued_url(url_info: wpull.url.URLInfo, record_info: wpull.pipeline.item.URLRecord)[source]

Callback fired after an URL was retrieved from the queue.

get_all()[source]
get_hostnames()[source]
get_one(url)[source]
get_revisit_id(url, payload_digest)[source]
get_root_url_todo_count()[source]
queue_count()[source]

Return the number of URLs queued in this session.

static queued_url(url_info: wpull.url.URLInfo)[source]

Callback fired after an URL was put into the queue.

release()[source]
remove_many(urls)[source]
update_one(*args, **kwargs)[source]

debug Module

Debugging utilities.

class wpull.debug.DebugConsoleHandler(application, request, **kwargs)[source]

Bases: tornado.web.RequestHandler

TEMPLATE = '<html>\n <style>\n #commandbox {{\n width: 100%;\n }}\n </style>\n <body>\n <p>Welcome to DEBUG CONSOLE!</p>\n <p><tt>Builder()</tt> instance at <tt>wpull_builder</tt>.</p>\n <form method="post">\n <input id="commandbox" name="command" value="{command}">\n <input type="submit" value="Execute">\n </form>\n <pre>{output}</pre>\n </body>\n </html>\n '
get()[source]
post()[source]

decompression Module

Streaming decompressors.

class wpull.decompression.DeflateDecompressor[source]

Bases: wpull.decompression.SimpleGzipDecompressor

zlib decompressor with raw deflate detection.

This class doesn’t do any special. It only tries regular zlib and then tries raw deflate on the first decompress.

decompress(value)[source]
flush()[source]
class wpull.decompression.GzipDecompressor[source]

Bases: wpull.decompression.SimpleGzipDecompressor

gzip decompressor with gzip header detection.

This class checks if the stream starts with the 2 byte gzip magic number. If it is not present, it returns the bytes unchanged.

decompress(value)[source]
flush()[source]
class wpull.decompression.SimpleGzipDecompressor[source]

Bases: object

Streaming gzip decompressor.

The interface is like that of zlib.decompressobj (without some of the optional arguments, but it understands gzip headers and checksums.

decompress(value)[source]

Decompress a chunk, returning newly-available data.

Some data may be buffered for later processing; flush must be called when there is no more input data to ensure that all data was processed.

flush()[source]

Return any remaining buffered data not yet returned by decompress.

Also checks for errors such as truncated input. No other methods may be called on this object after flush.

wpull.decompression.gzip_uncompress(data, truncated=False)[source]

Uncompress gzip data.

Parameters:
  • data (bytes) – The gzip data.
  • truncated (bool) – If True, the decompressor is not flushed.

This is a convenience function.

Returns:The inflated data.
Return type:bytes
Raises:zlib.error

document Module

Document handling.

document.base Module

Document bases.

class wpull.document.base.BaseDocumentDetector[source]

Bases: object

Base class for classes that detect document types.

classmethod is_file(file)[source]

Return whether the reader is likely able to read the file.

Parameters:file – A file object containing the document.
Returns:bool
classmethod is_request(request)[source]

Return whether the request is likely supported.

Parameters:request (http.request.Request) – An HTTP request.
Returns:bool
classmethod is_response(response)[source]

Return whether the response is likely able to be read.

Parameters:response (http.request.Response) – An HTTP response.
Returns:bool
classmethod is_supported(file=None, request=None, response=None, url_info=None)[source]

Given the hints, return whether the document is supported.

Parameters:
Returns:

If True, the reader should be able to read it.

Return type:

bool

classmethod is_url(url_info)[source]

Return whether the URL is likely to be supported.

Parameters:url_info (url.URLInfo) – A URLInfo.
Returns:bool
class wpull.document.base.BaseExtractiveReader[source]

Bases: object

Base class for document readers that can only extract links.

Return links from file.

Returns:Each item is a str which represents a link.
Return type:iterator
class wpull.document.base.BaseHTMLReader[source]

Bases: object

Base class for document readers for handling SGML-like documents.

iter_elements(file, encoding=None)[source]

Return an iterator of elements found in the document.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
Returns:

Each item is an element from document.htmlparse.element

Return type:

iterator

class wpull.document.base.BaseTextStreamReader[source]

Bases: object

Base class for document readers that filters link and non-link text.

Return the links.

This function is a convenience function for calling iter_text() and returning only the links.

iter_text(file, encoding=None)[source]

Return the file text and links.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
Returns:

Each item is a tuple:

  1. str: The text
  2. bool (or truthy value): Whether the text is a likely a link. If truthy value may be provided containing additional context of the link.

Return type:

iterator

The links returned are raw text and will require further processing.

wpull.document.base.VeryFalse = <wpull.document.base.VeryFalseType object>

Document is not definitely supported.

class wpull.document.base.VeryFalseType[source]

Bases: object

document.css Module

Stylesheet reader.

class wpull.document.css.CSSReader[source]

Bases: wpull.document.base.BaseDocumentDetector, wpull.document.base.BaseTextStreamReader

Cascading Stylesheet Document Reader.

BUFFER_SIZE = 1048576
IMPORT_URL_PATTERN = '@import\\s*(?:url\\()?[\'"]?([^\\s\'")]{1,500}).*?;'
STREAM_REWIND = 4096
URL_PATTERN = 'url\\(\\s*([\'"]?)(.{1,500}?)(?:\\1)\\s*\\)'
URL_REGEX = re.compile('url\\(\\s*([\'"]?)(.{1,500}?)(?:\\1)\\s*\\)|@import\\s*(?:url\\()?[\'"]?([^\\s\'")]{1,500}).*?;')
classmethod is_file(file)[source]

Return whether the file is likely CSS.

classmethod is_request(request)[source]

Return whether the document is likely to be CSS.

classmethod is_response(response)[source]

Return whether the document is likely to be CSS.

classmethod is_url(url_info)[source]

Return whether the document is likely to be CSS.

iter_text(file, encoding=None)[source]

document.html Module

HTML document readers.

wpull.document.html.COMMENT = <object object>

Comment element

class wpull.document.html.HTMLLightParserTarget(callback, text_elements=frozenset({'style', 'script', 'link', 'url', 'icon'}))[source]

Bases: object

An HTML parser target for partial elements.

Parameters:
  • callback

    A callback function. The function should accept the :param 1. tag: The tag name of the element. :type 1. tag: str :param 2. attrib: The attributes of the element. :type 2. attrib: dict :param 3. text: The text of the element.

    type 3. text:str, None
  • text_elements – A frozenset of element tag names that we should keep track of text.
close()[source]
data(data)[source]
end(tag)[source]
start(tag, attrib)[source]
class wpull.document.html.HTMLParserTarget(callback)[source]

Bases: object

An HTML parser target.

Parameters:callback

A callback function. The function should accept the :param 1. tag: The tag name of the element. :type 1. tag: str :param 2. attrib: The attributes of the element. :type 2. attrib: dict :param 3. text: The text of the element. :type 3. text: str, None :param 4. tail: The text after the element. :type 4. tail: str, None :param 5. end: Whether the tag is and end tag.

type 5. end:bool
close()[source]
comment(text)[source]
data(data)[source]
end(tag)[source]
start(tag, attrib)[source]
class wpull.document.html.HTMLReadElement(tag, attrib, text, tail, end)[source]

Bases: object

Results from HTMLReader.read_links().

tag

str

The element tag name.

attrib

dict

The element attributes.

text

str, None

The element text.

tail

str, None

The text after the element.

end

bool

Whether the tag is an end tag.

attrib
end
tag
tail
text
class wpull.document.html.HTMLReader(html_parser)[source]

Bases: wpull.document.base.BaseDocumentDetector, wpull.document.base.BaseHTMLReader

HTML document reader.

Parameters:html_parser (document.htmlparse.BaseParser) – An HTML parser.
classmethod is_file(file)[source]

Return whether the file is likely to be HTML.

classmethod is_request(request)[source]

Return whether the Request is likely to be a HTML.

classmethod is_response(response)[source]

Return whether the Response is likely to be HTML.

classmethod is_url(url_info)[source]

Return whether the URLInfo is likely to be a HTML.

iter_elements(file, encoding=None)[source]

document.htmlparse Module

HTML parsing.

document.htmlparse.base Module

class wpull.document.htmlparse.base.BaseParser[source]

Bases: object

parse(file, encoding=None)[source]

Parse the document for elements.

Returns:Each item is from :module:`.document.htmlparse.element`
Return type:iterator
parser_error

Return the Exception class for parsing errors.

document.htmlparse.element Module

HTML tree things.

wpull.document.htmlparse.element.Comment

A comment.

wpull.document.htmlparse.element.text

str

The comment text.

alias of CommentType

wpull.document.htmlparse.element.Doctype

A Doctype.

wpull.document.htmlparse.element.text

str

The Doctype text.

alias of DoctypeType

wpull.document.htmlparse.element.Element

An HTML element.

Attributes
tag (str): The tag name of the element. attrib (dict): The attributes of the element. text (str, None): The text of the element. tail (str, None): The text after the element. end (bool): Whether the tag is and end tag.

alias of ElementType

document.htmlparse.html5lib_ Module

Parsing using html5lib python.

class wpull.document.htmlparse.html5lib_.HTMLParser[source]

Bases: wpull.document.htmlparse.base.BaseParser

parse(file, encoding=None)[source]
parser_error

document.htmlparse.lxml_ Module

Parsing using lxml and libxml2.

class wpull.document.htmlparse.lxml_.HTMLParser[source]

Bases: wpull.document.htmlparse.base.BaseParser

HTML document parser.

This reader uses lxml as the parser.

BUFFER_SIZE = 131072
classmethod detect_parser_type(file, encoding=None)[source]

Get the suitable parser type for the document.

Returns:str
parse(file, encoding=None)[source]
classmethod parse_doctype(file, encoding=None)[source]

Get the doctype from the document.

Returns:str, None
parse_lxml(file, encoding=None, target_class=<class 'wpull.document.htmlparse.lxml_.HTMLParserTarget'>, parser_type='html')[source]

Return an iterator of elements found in the document.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
  • target_class – A class to be used for target parsing.
  • parser_type (str) – The type of parser to use. Accepted values: html, xhtml, xml.
Returns:

Each item is an element from document.htmlparse.element

Return type:

iterator

parser_error
class wpull.document.htmlparse.lxml_.HTMLParserTarget(callback)[source]

Bases: object

An HTML parser target.

Parameters:callback – A callback function. The function should accept one argument from document.htmlparse.element.
close()[source]
comment(text)[source]
data(data)[source]
end(tag)[source]
start(tag, attrib)[source]
wpull.document.htmlparse.lxml_.to_lxml_encoding(encoding)[source]

Check if lxml supports the specified encoding.

Returns:str, None

document.javascript Module

class wpull.document.javascript.JavaScriptReader[source]

Bases: wpull.document.base.BaseDocumentDetector, wpull.document.base.BaseTextStreamReader

JavaScript Document Reader.

BUFFER_SIZE = 1048576
STREAM_REWIND = 4096
URL_PATTERN = '(\\\\{0,8}[\'"])(https?://[^\'"]{1,500}|[^\\s\'"]{1,500})(?:\\1)'
URL_REGEX = re.compile('(\\\\{0,8}[\'"])(https?://[^\'"]{1,500}|[^\\s\'"]{1,500})(?:\\1)')
classmethod is_file(file)[source]

Return whether the file is likely JS.

classmethod is_request(request)[source]

Return whether the document is likely to be JS.

classmethod is_response(response)[source]

Return whether the document is likely to be JS.

classmethod is_url(url_info)[source]

Return whether the document is likely to be JS.

iter_text(file, encoding=None)[source]

Return an iterator of links found in the document.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
Returns:

str

Return type:

iterable

document.sitemap Module

Sitemap.xml

class wpull.document.sitemap.SitemapReader(html_parser)[source]

Bases: wpull.document.base.BaseDocumentDetector, wpull.document.base.BaseExtractiveReader

Sitemap XML reader.

MAX_ROBOTS_FILE_SIZE = 4096
classmethod is_file(file)[source]

Return whether the file is likely a Sitemap.

classmethod is_request(request)[source]

Return whether the document is likely to be a Sitemap.

classmethod is_response(response)[source]

Return whether the document is likely to be a Sitemap.

classmethod is_url(url_info)[source]

Return whether the document is likely to be a Sitemap.

document.util Module

Misc functions.

wpull.document.util.detect_response_encoding(response, is_html=False, peek=131072)[source]

Return the likely encoding of the response document.

Parameters:
  • response (Response) – An instance of http.Response.
  • is_html (bool) – See util.detect_encoding().
  • peek (int) – The maximum number of bytes of the document to be analyzed.
Returns:

The codec name.

Return type:

str, None

wpull.document.util.get_heading_encoding(response)[source]

Return the document encoding from a HTTP header.

Parameters:response (Response) – An instance of http.Response.
Returns:The codec name.
Return type:str, None
wpull.document.util.is_gzip(data)[source]

Return whether the data is likely to be gzip.

document.xml Module

XML document.

class wpull.document.xml.XMLDetector[source]

Bases: wpull.document.base.BaseDocumentDetector

classmethod is_file(file)[source]
classmethod is_request(request)[source]
classmethod is_response(response)[source]
classmethod is_url(url_info)[source]

driver Module

Interprocess communicators.

driver.phantomjs Module

class wpull.driver.phantomjs.PhantomJSDriver(exe_path='phantomjs', extra_args=None, params=None)[source]

Bases: wpull.driver.process.Process

PhantomJS processing.

Parameters:
  • exe_path (str) – Path of the PhantomJS executable.
  • extra_args (list) – Additional arguments for PhantomJS. Most likely, you’ll want to pass proxy settings for capturing traffic.
  • params (PhantomJSDriverParams) – Parameters for controlling the processing pipeline.

This class launches PhantomJS that scrolls and saves snapshots. It can only be used once per URL.

close()[source]
start(use_atexit=True)[source]
wpull.driver.phantomjs.PhantomJSDriverParams

PhantomJS Driver parameters

wpull.driver.phantomjs.url

str

URL of page to fetch.

wpull.driver.phantomjs.snapshot_type

list

List of filenames. Accepted extensions are html, pdf, png, gif.

wpull.driver.phantomjs.wait_time

float

Time between page scrolls.

wpull.driver.phantomjs.num_scrolls

int

Maximum number of scrolls.

wpull.driver.phantomjs.smart_scroll

bool

Whether to stop scrolling if number of requests & responses do not change.

wpull.driver.phantomjs.snapshot

bool

Whether to take snapshot files.

wpull.driver.phantomjs.viewport_size

tuple

Width and height of the page viewport.

wpull.driver.phantomjs.paper_size

tuple

Width and height of the paper size.

wpull.driver.phantomjs.event_log_filename

str

Path to save page events.

wpull.driver.phantomjs.action_log_filename

str

Path to save page action manipulation events.

wpull.driver.phantomjs.custom_headers

dict

Custom HTTP request headers.

wpull.driver.phantomjs.page_settings

dict

Page settings.

alias of PhantomJSDriverParamsType

wpull.driver.phantomjs.get_version(exe_path='phantomjs')[source]

Get the version string of PhantomJS.

driver.process Module

RPC processes.

class wpull.driver.process.Process(proc_args, stdout_callback=None, stderr_callback=None)[source]

Bases: object

Subprocess wrapper.

close()[source]

Terminate or kill the subprocess.

This function is blocking.

process

Return the underlying process.

start(use_atexit=True)[source]

Start the executable.

Parameters:use_atexit (bool) – If True, the process will automatically be terminated at exit.

errors Module

Exceptions.

exception wpull.errors.AuthenticationError[source]

Bases: wpull.errors.ServerError

Username or password error.

exception wpull.errors.ConnectionRefused[source]

Bases: wpull.errors.NetworkError

Server was online, but nothing was being served.

exception wpull.errors.DNSNotFound[source]

Bases: wpull.errors.NetworkError

Server’s IP address could not be located.

wpull.errors.ERROR_PRIORITIES = (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.AuthenticationError'>, <class 'wpull.errors.DNSNotFound'>, <class 'wpull.errors.ConnectionRefused'>, <class 'wpull.errors.NetworkError'>, <class 'OSError'>, <class 'OSError'>, <class 'ValueError'>)

List of error classes by least severe to most severe.

class wpull.errors.ExitStatus[source]

Bases: object

Program exit status codes.

generic_error

1

An unclassified serious or fatal error occurred.

parser_error

2

A local document or configuration file could not be parsed.

file_io_error

3

A problem with reading/writing a file occurred.

network_failure

4

A problem with the network occurred such as a DNS resolver error or a connection was refused.

ssl_verification_error

5

A server’s SSL/TLS certificate was invalid.

authentication_failure

6

A problem with a username or password.

protocol_error

7

A problem with communicating with a server occurred.

server_error

8

The server had problems fulfilling our requests.

authentication_failure = 6
file_io_error = 3
generic_error = 1
network_failure = 4
parser_error = 2
protocol_error = 7
server_error = 8
ssl_verification_error = 5
exception wpull.errors.NetworkError[source]

Bases: OSError

A networking error.

exception wpull.errors.NetworkTimedOut[source]

Bases: wpull.errors.NetworkError

Connection read/write timed out.

exception wpull.errors.ProtocolError[source]

Bases: ValueError

A protocol was not followed.

wpull.errors.SSLVerficationError

alias of SSLVerificationError

exception wpull.errors.SSLVerificationError[source]

Bases: OSError

A problem occurred validating SSL certificates.

exception wpull.errors.ServerError[source]

Bases: ValueError

Server issued an error.

namevalue Module

Key-value pairs.

class wpull.namevalue.NameValueRecord(normalize_overrides=None, encoding='utf-8', wrap_width=None)[source]

Bases: collections.abc.MutableMapping

An ordered mapping of name-value pairs.

Duplicated names are accepted.

add(name, value)[source]

Append the name-value pair to the record.

get_all()[source]

Return an iterator of name-value pairs.

get_list(name)[source]

Return all the values for given name.

parse(string, strict=True)[source]

Parse the string or bytes.

Parameters:strict (bool) – If True, errors will not be ignored
Raises:ValueError if the record is malformed.
to_bytes(errors='strict')[source]

Convert to bytes.

to_str()[source]

Convert to string.

wpull.namevalue.guess_line_ending(string)[source]

Return the most likely line delimiter from the string.

wpull.namevalue.normalize_name(name, overrides=None)[source]

Normalize the key name to title case.

For example, normalize_name('content-id') will become Content-Id

Parameters:
  • name (str) – The name to normalize.
  • overrides (set, sequence) – A set or sequence containing keys that should be cased to themselves. For example, passing set('WARC-Type') will normalize any key named “warc-type” to WARC-Type instead of the default Warc-Type.
Returns:

str

wpull.namevalue.unfold_lines(string)[source]

Join lines that are wrapped.

Any line that starts with a space or tab is joined to the previous line.

network Module

network.bandwidth Module

Network bandwidth.

class wpull.network.bandwidth.BandwidthLimiter(rate_limit)[source]

Bases: wpull.network.bandwidth.BandwidthMeter

Bandwidth rate limit calculator.

sleep_time()[source]
class wpull.network.bandwidth.BandwidthMeter(sample_size=20, sample_min_time=0.15, stall_time=5.0)[source]

Bases: object

Calculates the speed of data transfer.

Parameters:
  • sample_size (int) – The number of samples for measuring the speed.
  • sample_min_time (float) – The minimum duration between samples in seconds.
  • stall_time (float) – The time in seconds to consider no traffic to be connection stalled.
bytes_transferred

Return the number of bytes transferred

Returns:int
feed(data_len, feed_time=None)[source]

Update the bandwidth meter.

Parameters:
  • data_len (int) – The number of bytes transfered since the last call to feed().
  • feed_time (float) – Current time.
num_samples

Return the number of samples collected.

speed()[source]

Return the current transfer speed.

Returns:The speed in bytes per second.
Return type:int
stalled

Return whether the connection is stalled.

Returns:bool

network.connection Module

Network connections.

class wpull.network.connection.BaseConnection(address: tuple, hostname: typing.Union=None, timeout: typing.Union=None, connect_timeout: typing.Union=None, bind_host: typing.Union=None, sock: typing.Union=None)[source]

Bases: object

Base network stream.

Parameters:
  • address – 2-item tuple containing the IP address and port or 4-item for IPv6.
  • hostname – Hostname of the address (for SSL).
  • timeout – Time in seconds before a read/write operation times out.
  • connect_timeout – Time in seconds before a connect operation times out.
  • bind_host – Host name for binding the socket interface.
  • sock – Use given socket. The socket must already by connected.
reader

Stream Reader instance.

writer

Stream Writer instance.

address

2-item tuple containing the IP address.

host

Host name.

port

Port number.

address
close()[source]

Close the connection.

closed() → bool[source]

Return whether the connection is closed.

connect()[source]

Establish a connection.

host
hostname
port
read(amount: int=-1) → bytes[source]

Read data.

readline() → bytes[source]

Read a line of data.

reset()[source]

Prepare connection for reuse.

run_network_operation(task, wait_timeout=None, close_timeout=None, name='Network operation')[source]

Run the task and raise appropriate exceptions.

Coroutine.

state() → wpull.network.connection.ConnectionState[source]

Return the state of this connection.

write(data: bytes, drain: bool=True)[source]

Write data.

class wpull.network.connection.CloseTimer(timeout, connection)[source]

Bases: object

Periodic timer to close connections if stalled.

close()[source]

Stop running timers.

is_timeout() → bool[source]

Return whether the timer has timed out.

with_timeout()[source]

Context manager that applies timeout checks.

class wpull.network.connection.Connection(*args, bandwidth_limiter=None, **kwargs)[source]

Bases: wpull.network.connection.BaseConnection

Network stream.

Parameters:(class (bandwidth_limiter) – .bandwidth.BandwidthLimiter): Bandwidth limiter for connection speed limiting.
key

Value used by the ConnectionPool for its host pool map. Internal use only.

wrapped_connection

A wrapped connection for ConnectionPool. Internal use only.

is_ssl

bool

Whether connection is SSL.

proxied

bool

Whether the connection is to a HTTP proxy.

tunneled

bool

Whether the connection has been tunneled with the CONNECT request.

is_ssl
proxied
read(amount: int=-1) → bytes[source]
start_tls(ssl_context: typing.Union=True) → 'SSLConnection'[source]

Start client TLS on this connection and return SSLConnection.

Coroutine

tunneled
class wpull.network.connection.ConnectionState[source]

Bases: enum.Enum

State of a connection

ready

Connection is ready to be used

created

connect has been called successfully

dead

Connection is closed

class wpull.network.connection.DummyCloseTimer[source]

Bases: object

Dummy close timer.

close()[source]
is_timeout()[source]
with_timeout()[source]
class wpull.network.connection.SSLConnection(*args, ssl_context: typing.Union=True, **kwargs)[source]

Bases: wpull.network.connection.Connection

SSL network stream.

Parameters:ssl_context – SSLContext
connect()[source]
is_ssl

network.dns Module

DNS resolution.

wpull.network.dns.AddressInfo

Socket address.

alias of _AddressInfo

class wpull.network.dns.DNSInfo[source]

Bases: wpull.network.dns._DNSInfo

DNS resource records.

to_text_format()[source]

Format as detached DNS information as text.

class wpull.network.dns.IPFamilyPreference[source]

Bases: enum.Enum

IPv4 and IPV6 preferences.

class wpull.network.dns.ResolveResult(address_infos: typing.List, dns_infos: typing.Union=None)[source]

Bases: object

DNS resolution information.

addresses

The socket addresses.

dns_infos

The DNS resource records.

first_ipv4

The first IPv4 address.

first_ipv6

The first IPV6 address.

rotate()[source]

Move the first address to the last position.

shuffle()[source]

Shuffle the addresses.

class wpull.network.dns.Resolver(family: wpull.network.dns.IPFamilyPreference=<IPFamilyPreference.any: 'any'>, timeout: typing.Union=None, bind_address: typing.Union=None, cache: typing.Union=None, rotate: bool=False)[source]

Bases: wpull.application.hook.HookableMixin

Asynchronous resolver with cache and timeout.

Parameters:
  • family – IPv4 or IPv6 preference.
  • timeout – A time in seconds used for timing-out requests. If not specified, this class relies on the underlying libraries.
  • bind_address – An IP address to bind DNS requests if possible.
  • cache – Cache to store results of any query.
  • rotate – If result is cached rotates the results, otherwise, shuffle the results.
classmethod new_cache() → wpull.cache.FIFOCache[source]

Return a default cache

resolve(host: str) → wpull.network.dns.ResolveResult[source]

Resolve hostname.

Parameters:

host – Hostname.

Returns:

Resolved IP addresses.

Raises:
  • DNSNotFound if the hostname could not be resolved or
  • NetworkError if there was an error connecting to DNS servers.

Coroutine.

static resolve_dns(host: str) → str[source]

Resolve the hostname to an IP address.

Parameters:host – The hostname.

This callback is to override the DNS lookup.

It is useful when the server is no longer available to the public. Typically, large infrastructures will change the DNS settings to make clients no longer hit the front-ends, but rather go towards a static HTTP server with a “We’ve been acqui-hired!” page. In these cases, the original servers may still be online.

Returns:None to use the original behavior or a string containing an IP address or an alternate hostname.
Return type:str, None
static resolve_dns_result(host: str, result: wpull.network.dns.ResolveResult)[source]

Callback when a DNS resolution has been made.

network.pool Module

class wpull.network.pool.ConnectionPool(max_host_count: int=6, resolver: typing.Union=None, connection_factory: typing.Union=None, ssl_connection_factory: typing.Union=None, max_count: int=100)[source]

Bases: object

Connection pool.

Parameters:
  • max_host_count – Number of connections per host.
  • resolver – DNS resolver.
  • connection_factory – A function that accepts address and hostname arguments and returns a Connection instance.
  • ssl_connection_factory – A function that returns a SSLConnection instance. See connection_factory.
  • max_count – Limit on number of connections
acquire(host: str, port: int, use_ssl: bool=False, host_key: typing.Any=None) → wpull.network.connection.Connection[source]

Return an available connection.

Parameters:
  • host – A hostname or IP address.
  • port – Port number.
  • use_ssl – Whether to return a SSL connection.
  • host_key – If provided, it overrides the key used for per-host connection pooling. This is useful for proxies for example.

Coroutine.

clean(force: bool=False)[source]

Clean all closed connections.

Parameters:force – Clean connected and idle connections too.

Coroutine.

close()[source]

Close all the connections and clean up.

This instance will not be usable after calling this method.

count() → int[source]

Return number of connections.

host_pools
no_wait_release(connection: wpull.network.connection.Connection)[source]

Synchronous version of release().

release(connection: wpull.network.connection.Connection)[source]

Put a connection back in the pool.

Coroutine.

session(host: str, port: int, use_ssl: bool=False)[source]

Return a context manager that returns a connection.

Usage:

session = yield from connection_pool.session('example.com', 80)
with session as connection:
    connection.write(b'blah')
    connection.close()

Coroutine.

class wpull.network.pool.HappyEyeballsConnection(address, connection_factory, resolver, happy_eyeballs_table, is_ssl=False)[source]

Bases: object

Wrapper for happy eyeballs connection.

close()[source]
closed()[source]
connect()[source]
reset()[source]
class wpull.network.pool.HappyEyeballsTable(max_items=100, time_to_live=600)[source]

Bases: object

get_preferred(addr_1, addr_2)[source]

Return the preferred address.

set_preferred(preferred_addr, addr_1, addr_2)[source]

Set the preferred address.

class wpull.network.pool.HostPool(connection_factory: typing.Callable, max_connections: int=6)[source]

Bases: object

Connection pool for a host.

ready

Queue

Connections not in use.

busy

set

Connections in use.

acquire() → wpull.network.connection.Connection[source]

Register and return a connection.

Coroutine.

clean(force: bool=False)[source]

Clean closed connections.

Parameters:force – Clean connected and idle connections too.

Coroutine.

close()[source]

Forcibly close all connections.

This instance will not be usable after calling this method.

count() → int[source]

Return total number of connections.

empty() → bool[source]

Return whether the pool is empty.

release(connection: wpull.network.connection.Connection, reuse: bool=True)[source]

Unregister a connection.

Parameters:
  • connection – Connection instance returned from acquire().
  • reuse – If True, the connection is made available for reuse.

Coroutine.

observer Module

Observer.

class wpull.observer.Observer(*handlers)[source]

Bases: object

Observer.

Parameters:handlers – Callback functions.
add(handler)[source]

Register a callback function.

clear()[source]

Remove all callback handlers.

count()[source]

Return the number register handlers.

notify(*args, **kwargs)[source]

Call all the callback handlers with given arguments.

remove(handler)[source]

Unregister a callback function.

path Module

File names and paths.

class wpull.path.BasePathNamer[source]

Bases: object

Base class for path namers.

get_filename(url_info)[source]

Return the appropriate filename based on given URLInfo.

class wpull.path.PathNamer(root, index='index.html', use_dir=False, cut=None, protocol=False, hostname=False, os_type='unix', no_control=True, ascii_only=True, case=None, max_filename_length=None)[source]

Bases: wpull.path.BasePathNamer

Path namer that creates a directory hierarchy based on the URL.

Parameters:
  • root (str) – The base path.
  • index (str) – The filename to use when the URL path does not indicate one.
  • use_dir (bool) – Include directories based on the URL path.
  • cut (int) – Number of leading directories to cut from the file path.
  • protocol (bool) – Include the URL scheme in the directory structure.
  • hostname (bool) – Include the hostname in the directory structure.
  • safe_filename_args (dict) – Keyword arguments for safe_filename.

See also: url_to_filename(), url_to_dir_path(), safe_filename().

get_filename(url_info)[source]
safe_filename(part)[source]

Return a safe filename or file part.

class wpull.path.PercentEncoder(unix=False, control=False, windows=False, ascii_=False)[source]

Bases: collections.defaultdict

Percent encoder.

quote(bytes_string)[source]
wpull.path.anti_clobber_dir_path(dir_path, suffix='.d')[source]

Return a directory path free of filenames.

Parameters:
  • dir_path (str) – A directory path.
  • suffix (str) – The suffix to append to the part of the path that is a file.
Returns:

str

wpull.path.parse_content_disposition(text)[source]

Parse a Content-Disposition header value.

wpull.path.safe_filename(filename, os_type='unix', no_control=True, ascii_only=True, case=None, encoding='utf8', max_length=None)[source]

Return a safe filename or path part.

Parameters:
  • filename (str) – The filename or path component.
  • os_type (str) – If unix, escape the slash. If windows, escape extra Windows characters.
  • no_control (bool) – If True, escape control characters.
  • ascii_only (bool) – If True, escape non-ASCII characters.
  • case (str) – If lower, lowercase the string. If upper, uppercase the string.
  • encoding (str) – The character encoding.
  • max_length (int) – The maximum length of the filename.

This function assumes that filename has not already been percent-encoded.

Returns:str
wpull.path.url_to_dir_parts(url, include_protocol=False, include_hostname=False, alt_char=False)[source]

Return a list of directory parts from a URL.

Parameters:
  • url (str) – The URL.
  • include_protocol (bool) – If True, the scheme from the URL will be included.
  • include_hostname (bool) – If True, the hostname from the URL will be included.
  • alt_char (bool) – If True, the character for the port deliminator will be + intead of :.

This function does not include the filename and the paths are not sanitized.

Returns:list
wpull.path.url_to_filename(url, index='index.html', alt_char=False)[source]

Return a filename from a URL.

Parameters:
  • url (str) – The URL.
  • index (str) – If a filename could not be derived from the URL path, use index instead. For example, /images/ will return index.html.
  • alt_char (bool) – If True, the character for the query deliminator will be @ intead of ?.

This function does not include the directories and does not sanitize the filename.

Returns:str

pipeline Module

pipeline.app Module

class wpull.pipeline.app.AppSession(factory: wpull.application.factory.Factory, args, stderr)[source]

Bases: object

class wpull.pipeline.app.AppSource(session: wpull.pipeline.app.AppSession)[source]

Bases: wpull.pipeline.pipeline.ItemSource

get_item() → typing.Union[source]
wpull.pipeline.app.new_encoded_stream(args, stream)[source]

Return a stream writer.

pipeline.item Module

URL items.

class wpull.pipeline.item.LinkType[source]

Bases: enum.Enum

The type of contents that a link is expected to have.

css = None

Stylesheet file. Recursion on links is usually safe.

directory = None

FTP directory.

file = None

FTP File.

html = None

HTML document.

javascript = None

JavaScript file. Possible to recurse links on this file.

media = None

Image or video file. Recursion on this type will not be useful.

sitemap = None

A Sitemap.xml file.

class wpull.pipeline.item.Status[source]

Bases: enum.Enum

URL status.

done = None

The item has been processed successfully.

error = None

The item encountered an error during processing.

in_progress = None

The item is in progress of being processed.

skipped = None

The item was excluded from processing due to some rejection filters.

todo = None

The item has not yet been processed.

class wpull.pipeline.item.URLData[source]

Bases: wpull.pipeline.item.URLDatabaseMixin

Data associated fetching the URL.

post_data (str): If given, the URL should be fetched as a
POST request containing post_data.
database_attributes = ('post_data',)
class wpull.pipeline.item.URLDatabaseMixin[source]

Bases: object

database_items()[source]
class wpull.pipeline.item.URLProperties[source]

Bases: wpull.pipeline.item.URLDatabaseMixin

URL properties that determine whether a URL is fetched.

parent_url

str

The parent or referral URL that linked to this URL.

root_url

str

The earliest ancestor URL of this URL. This URL is typically the URL supplied at the start of the program.

status

Status

Processing status of this URL.

try_count

int

The number of attempts on this URL.

level

int

The recursive depth of this URL. A level of 0 indicates the URL was initially supplied to the program (the top URL). Level 1 means the URL was linked from the top URL.

inline_level

int

Whether this URL was an embedded object (such as an image or a stylesheet) of the parent URL.

The value represents the recursive depth of the object. For example, an iframe is depth 1 and the images in the iframe is depth 2.

LinkType

Describes the expected document type.

database_attributes = ('parent_url', 'root_url', 'status', 'try_count', 'level', 'inline_level', 'link_type', 'priority')
parent_url_info

Return URL Info for the parent URL

root_url_info

Return URL Info for the root URL

class wpull.pipeline.item.URLRecord[source]

Bases: wpull.pipeline.item.URLProperties, wpull.pipeline.item.URLData, wpull.pipeline.item.URLResult

An entry in the URL table describing a URL to be downloaded.

url

str

The URL.

url_info

Return URL Info for this URL

class wpull.pipeline.item.URLResult[source]

Bases: wpull.pipeline.item.URLDatabaseMixin

Data associated with the fetched URL.

status_code (int): The HTTP or FTP status code. filename (str): The path to where the file was saved.

database_attributes = ('status_code', 'filename')

pipeline.pipeline Module

class wpull.pipeline.pipeline.ItemQueue[source]

Bases: typing.Generic

get() → typing.WorkItemT[source]
item_done()[source]
put_item(item: typing.WorkItemT)[source]
put_poison_nowait()[source]
unfinished_items
wait_for_worker()[source]
class wpull.pipeline.pipeline.ItemSource[source]

Bases: typing.Generic

get_item() → typing.Union[source]
class wpull.pipeline.pipeline.ItemTask[source]

Bases: typing.Generic

process(work_item: typing.WorkItemT)[source]
class wpull.pipeline.pipeline.Pipeline(item_source: wpull.pipeline.pipeline.ItemSource, tasks: typing.Sequence, item_queue: typing.Union=None)[source]

Bases: object

concurrency
process()[source]
stop()[source]
tasks
class wpull.pipeline.pipeline.PipelineSeries(pipelines: typing.Iterator)[source]

Bases: object

concurrency
concurrency_pipelines
pipelines
class wpull.pipeline.pipeline.PipelineState[source]

Bases: enum.Enum

class wpull.pipeline.pipeline.Producer(item_source: wpull.pipeline.pipeline.ItemSource, item_queue: wpull.pipeline.pipeline.ItemQueue)[source]

Bases: object

process()[source]
process_one()[source]
stop()[source]
class wpull.pipeline.pipeline.Worker(item_queue: wpull.pipeline.pipeline.ItemQueue, tasks: typing.Sequence)[source]

Bases: object

process()[source]
process_one(_worker_id=None)[source]

pipeline.progress Module

class wpull.pipeline.progress.BarProgress(*args, draw_interval: float=0.5, bar_width: int=25, human_format: bool=True, **kwargs)[source]

Bases: wpull.pipeline.progress.ProgressPrinter

update()[source]
class wpull.pipeline.progress.DotProgress(*args, draw_interval: float=2.0, **kwargs)[source]

Bases: wpull.pipeline.progress.ProgressPrinter

update()[source]
class wpull.pipeline.progress.Measurement[source]

Bases: enum.Enum

class wpull.pipeline.progress.Progress(stream: typing.IO=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='ANSI_X3.4-1968'>)[source]

Bases: wpull.application.hook.HookableMixin

Print file download progress as dots or a bar.

Parameters:
  • bar_style (bool) – If True, print as a progress bar. If False, print dots every few seconds.
  • stream – A file object. Default is usually stderr.
  • human_format (true) – If True, format sizes in units. Otherwise, output bits only.
reset()[source]
update()[source]
class wpull.pipeline.progress.ProgressPrinter(*args, **kwargs)[source]

Bases: wpull.pipeline.progress.ProtocolProgress

update_from_end_response(response: wpull.protocol.abstract.request.BaseResponse)[source]
class wpull.pipeline.progress.ProtocolProgress(*args, **kwargs)[source]

Bases: wpull.pipeline.progress.Progress

class State[source]

Bases: enum.Enum

ProtocolProgress.update_from_begin_request(request: wpull.protocol.abstract.request.BaseRequest)[source]
ProtocolProgress.update_from_begin_response(response: wpull.protocol.abstract.request.BaseResponse)[source]
ProtocolProgress.update_from_end_response(response: wpull.protocol.abstract.request.BaseResponse)[source]
ProtocolProgress.update_with_data(data)[source]

pipeline.session Module

class wpull.pipeline.session.ItemSession(app_session: wpull.pipeline.app.AppSession, url_record: wpull.pipeline.item.URLRecord)[source]

Bases: object

Item for a URL that needs to processed.

add_child_url(url: str, inline: bool=False, link_type: typing.Union=None, post_data: typing.Union=None, level: typing.Union=None, replace: bool=False)[source]

Add links scraped from the document with automatic values.

Parameters:
  • url – A full URL. (It can’t be a relative path.)
  • inline – Whether the URL is an embedded object.
  • link_type – Expected link type.
  • post_data – URL encoded form data. The request will be made using POST. (Don’t use this to upload files.)
  • level – The child depth of this URL.
  • replace – Whether to replace the existing entry in the database table so it will be redownloaded again.

This function provides values automatically for:

  • inline
  • level
  • parent: The referrering page.
  • root

See also add_url().

add_url(url: str, url_properites: typing.Union=None, url_data: typing.Union=None)[source]
child_url_record(url: str, inline: bool=False, link_type: typing.Union=None, post_data: typing.Union=None, level: typing.Union=None)[source]

Return a child URLRecord.

This function is useful for testing filters before adding to table.

finish()[source]
is_processed

Return whether the item has been processed.

is_virtual
request
response
set_status(status: wpull.pipeline.item.Status, increment_try_count: bool=True, filename: str=None)[source]

Mark the item with the given status.

Parameters:
  • status – a value from Status.
  • increment_try_count – if True, increment the try_count value
skip()[source]

Mark the item as processed without download.

update_record_value(**kwargs)[source]
class wpull.pipeline.session.URLItemSource(app_session: wpull.pipeline.app.AppSession)[source]

Bases: wpull.pipeline.pipeline.ItemSource

get_item() → typing.Union[source]

processor Module

Item processing.

processor.base Module

Base classes for processors.

class wpull.processor.base.BaseProcessor[source]

Bases: object

Base class for processors.

Processors contain the logic for processing requests.

close()[source]

Run any clean up actions.

process(item_session: wpull.pipeline.session.ItemSession)[source]

Process an URL Item.

Parameters:item_session – The URL item.

This function handles the logic for processing a single URL item.

It must call one of engine.URLItem.set_status() or engine.URLItem.skip().

Coroutine.

class wpull.processor.base.BaseProcessorSession[source]

Bases: object

Base class for processor sessions.

wpull.processor.base.REMOTE_ERRORS = (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.NetworkError'>)

List of error classes that are errors that occur with a server.

processor.coprocessor Module

Additional processing not associated with Wget behavior.

processor.coprocessor.phantomjs Module

PhantomJS page loading and scrolling.

class wpull.processor.coprocessor.phantomjs.PhantomJSCoprocessor(phantomjs_driver_factory: typing.Callable, processing_rule: wpull.processor.rule.ProcessingRule, phantomjs_params: wpull.processor.coprocessor.phantomjs.PhantomJSParamsType, warc_recorder=None, root_path='.')[source]

Bases: object

PhantomJS coprocessor.

Parameters:
  • phantomjs_driver_factory – Callback function that accepts params argument and returns PhantomJSDriver
  • processing_rule – Processing rule.
  • warc_recorder – WARC recorder.
  • root_dir (str) – Root directory path for temp files.
process(item_session: wpull.pipeline.session.ItemSession, request, response, file_writer_session)[source]

Process PhantomJS.

Coroutine.

class wpull.processor.coprocessor.phantomjs.PhantomJSCoprocessorSession(phantomjs_driver_factory, root_path, processing_rule, file_writer_session, request, response, item_session: wpull.pipeline.session.ItemSession, params, warc_recorder)[source]

Bases: object

PhantomJS coprocessor session.

close()[source]

Clean up.

run()[source]
exception wpull.processor.coprocessor.phantomjs.PhantomJSCrashed[source]

Bases: Exception

PhantomJS exited with non-zero code.

wpull.processor.coprocessor.phantomjs.PhantomJSParams

PhantomJS parameters

wpull.processor.coprocessor.phantomjs.snapshot_type

list

File types. Accepted are html, pdf, png, gif.

wpull.processor.coprocessor.phantomjs.wait_time

float

Time between page scrolls.

wpull.processor.coprocessor.phantomjs.num_scrolls

int

Maximum number of scrolls.

wpull.processor.coprocessor.phantomjs.smart_scroll

bool

Whether to stop scrolling if number of requests & responses do not change.

wpull.processor.coprocessor.phantomjs.snapshot

bool

Whether to take snapshot files.

wpull.processor.coprocessor.phantomjs.viewport_size

tuple

Width and height of the page viewport.

wpull.processor.coprocessor.phantomjs.paper_size

tuple

Width and height of the paper size.

wpull.processor.coprocessor.phantomjs.load_time

float

Maximum time to wait for page load.

wpull.processor.coprocessor.phantomjs.custom_headers

dict

Default HTTP headers.

wpull.processor.coprocessor.phantomjs.page_settings

dict

Page settings.

alias of PhantomJSParamsType

processor.coprocessor.proxy Module

class wpull.processor.coprocessor.proxy.ProxyCoprocessor(app_session: wpull.pipeline.app.AppSession)[source]

Bases: object

Proxy coprocessor.

class wpull.processor.coprocessor.proxy.ProxyCoprocessorSession(app_session: wpull.pipeline.app.AppSession, http_proxy_session: wpull.proxy.server.HTTPProxySession)[source]

Bases: object

class wpull.processor.coprocessor.proxy.ProxyItemSession(app_session: wpull.pipeline.app.AppSession, url_record: wpull.pipeline.item.URLRecord)[source]

Bases: wpull.pipeline.session.ItemSession

is_virtual
skip()[source]

processor.coprocessor.youtubedl Module

class wpull.processor.coprocessor.youtubedl.Session(proxy_address, youtube_dl_path, root_path, item_session: wpull.pipeline.session.ItemSession, file_writer_session, user_agent, warc_recorder, inet_family, check_certificate)[source]

Bases: object

youtube-dl session.

close()[source]
run()[source]
class wpull.processor.coprocessor.youtubedl.YoutubeDlCoprocessor(youtube_dl_path, proxy_address, root_path='.', user_agent=None, warc_recorder=None, inet_family=False, check_certificate=True)[source]

Bases: object

youtube-dl coprocessor.

process(item_session: wpull.pipeline.session.ItemSession, request, response, file_writer_session)[source]
wpull.processor.coprocessor.youtubedl.get_version(exe_path='youtube-dl')[source]

Get the version string of youtube-dl.

processor.delegate Module

Delegation to other processor.

class wpull.processor.delegate.DelegateProcessor[source]

Bases: wpull.processor.base.BaseProcessor

Delegate to Web or FTP processor.

close()[source]
process(item_session: wpull.pipeline.session.ItemSession)[source]
register(scheme: str, processor: wpull.processor.base.BaseProcessor)[source]

processor.ftp Module

FTP

class wpull.processor.ftp.FTPProcessor(ftp_client: wpull.protocol.ftp.client.Client, fetch_params)[source]

Bases: wpull.processor.base.BaseProcessor

FTP processor.

Parameters:
  • ftp_client – The FTP client.
  • fetch_params (WebProcessorFetchParams) – Parameters for fetching.
close()[source]

Close the FTP client.

fetch_params

The fetch parameters.

ftp_client

The ftp client.

listing_cache

Listing cache.

Returns:A cache mapping from URL to list of ftp.ls.listing.FileEntry.
process(item_session: wpull.pipeline.session.ItemSession)[source]
wpull.processor.ftp.FTPProcessorFetchParams

FTPProcessorFetchParams

Parameters:
  • remove_listing (bool) – Remove .listing files after fetching.
  • glob (bool) – Enable URL globbing.
  • preserve_permissions (bool) – Preserve file permissions.
  • follow_symlinks (bool) – Follow symlinks.

alias of FTPProcessorFetchParamsType

class wpull.processor.ftp.FTPProcessorSession(processor: wpull.processor.ftp.FTPProcessor, item_session: wpull.pipeline.session.ItemSession)[source]

Bases: wpull.processor.base.BaseProcessorSession

Fetches FTP files or directory listings.

close()[source]
process()[source]

Process.

Coroutine.

exception wpull.processor.ftp.HookPreResponseBreak[source]

Bases: wpull.errors.ProtocolError

Hook pre-response break.

wpull.processor.ftp.append_slash_to_path_url(url_info: wpull.url.URLInfo) → str[source]

Return URL string with the path suffixed with a slash.

wpull.processor.ftp.to_dir_path_url(url_info: wpull.url.URLInfo) → str[source]

Return URL string with the path replaced with directory only.

processor.rule Module

Fetching rules.

class wpull.processor.rule.FetchRule(url_filter: wpull.urlfilter.DemuxURLFilter=None, robots_txt_checker: wpull.protocol.http.robots.RobotsTxtChecker=None, http_login: typing.Union=None, ftp_login: typing.Union=None, duration_timeout: typing.Union=None)[source]

Bases: wpull.application.hook.HookableMixin

Decide on what URLs should be fetched.

check_ftp_request(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple

Check URL filters and scripting hook.

Returns:(bool, str)
Return type:tuple
check_generic_request(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple[source]

Check URL filters and scripting hook.

Returns:(bool, str)
Return type:tuple
check_initial_web_request(item_session: wpull.pipeline.session.ItemSession, request: wpull.protocol.http.request.Request) → typing.Tuple[source]

Check robots.txt, URL filters, and scripting hook.

Returns:(bool, str)
Return type:tuple

Coroutine.

check_subsequent_web_request(item_session: wpull.pipeline.session.ItemSession, is_redirect: bool=False) → typing.Tuple[source]

Check URL filters and scripting hook.

Returns:(bool, str)
Return type:tuple
consult_filters(url_info: wpull.url.URLInfo, url_record: wpull.pipeline.item.URLRecord, is_redirect: bool=False) → typing.Tuple[source]

Consult the URL filter.

Parameters:
  • url_record – The URL record.
  • is_redirect – Whether the request is a redirect and it is desired that it spans hosts.
Returns

tuple:

  1. bool: The verdict
  2. str: A short reason string: nofilters, filters, redirect
  3. dict: The result from DemuxURLFilter.test_info()
consult_helix_fossil() → bool[source]

Consult the helix fossil.

Returns:True if can fetch
consult_hook(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reason: str, test_info: dict)[source]

Consult the scripting hook.

Returns:(bool, str)
Return type:tuple
consult_robots_txt(request: wpull.protocol.http.request.Request) → bool[source]

Consult by fetching robots.txt as needed.

Parameters:request – The request to be made to get the file.
Returns:True if can fetch

Coroutine

classmethod is_only_span_hosts_failed(test_info: dict) → bool[source]

Return whether only the SpanHostsFilter failed.

static plugin_accept_url(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reasons: dict) → bool[source]

Return whether to download this URL.

Parameters:
  • item_session – Current URL item.
  • verdict – A bool indicating whether Wpull wants to download the URL.
  • reasons

    A dict containing information for the verdict:

    • filters (dict): A mapping (str to bool) from filter name to whether the filter passed or not.
    • reason (str): A short reason string. Current values are: filters, robots, redirect.
Returns:

If True, the URL should be downloaded. Otherwise, the URL is skipped.

class wpull.processor.rule.ProcessingRule(fetch_rule: wpull.processor.rule.FetchRule, document_scraper: wpull.scraper.base.DemuxDocumentScraper=None, sitemaps: bool=False, url_rewriter: wpull.urlrewrite.URLRewriter=None)[source]

Bases: wpull.application.hook.HookableMixin

Document processing rules.

Parameters:
  • fetch_rule – The FetchRule instance.
  • document_scraper – The document scraper.
add_extra_urls(item_session: wpull.pipeline.session.ItemSession)[source]

Add additional URLs such as robots.txt, favicon.ico.

static parse_url(url, encoding='utf-8')

Parse and return a URLInfo.

This function logs a warning if the URL cannot be parsed and returns None.

static plugin_get_urls(item_session: wpull.pipeline.session.ItemSession)[source]

Add additional URLs to be added to the URL Table.

When this event is dispatched, the caller should add any URLs needed using ItemSession.add_child_url().

rewrite_url(url_info: wpull.url.URLInfo) → wpull.url.URLInfo[source]

Return a rewritten URL such as escaped fragment.

scrape_document(item_session: wpull.pipeline.session.ItemSession)[source]

Process document for links.

class wpull.processor.rule.ResultRule(ssl_verification: bool=False, retry_connrefused: bool=False, retry_dns_error: bool=False, waiter: typing.Union=None, statistics: typing.Union=None)[source]

Bases: wpull.application.hook.HookableMixin

Decide on the results of a fetch.

Parameters:
  • ssl_verification – If True, don’t ignore certificate errors.
  • retry_connrefused – If True, don’t consider a connection refused error to be a permanent error.
  • retry_dns_error – If True, don’t consider a DNS resolution error to be permanent error.
  • waiter – The Waiter.
  • statistics – The Statistics.
consult_error_hook(item_session: wpull.pipeline.session.ItemSession, error: BaseException)[source]

Return scripting action when an error occured.

consult_pre_response_hook(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Return scripting action when a response begins.

consult_response_hook(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Return scripting action when a response ends.

get_wait_time(item_session: wpull.pipeline.session.ItemSession, error=None)[source]

Return the wait time in seconds between requests.

handle_document(item_session: wpull.pipeline.session.ItemSession, filename: str) → wpull.application.hook.Actions[source]

Process a successful document response.

Returns:A value from hook.Actions.
handle_document_error(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Callback for when the document only describes an server error.

Returns:A value from hook.Actions.
handle_error(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]

Process an error.

Returns:A value from hook.Actions.
handle_intermediate_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Callback for successful intermediate responses.

Returns:A value from hook.Actions.
handle_no_document(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Callback for successful responses containing no useful document.

Returns:A value from hook.Actions.
handle_pre_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Process a response that is starting.

handle_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Generic handler for a response.

Returns:A value from hook.Actions.
static plugin_handle_error(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]

Return an action to handle the error.

Parameters:
  • item_session
  • error
Returns:

A value from Actions. The default is Actions.NORMAL.

static plugin_handle_pre_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Return an action to handle a response status before a download.

Parameters:item_session
Returns:A value from Actions. The default is Actions.NORMAL.
static plugin_handle_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]

Return an action to handle the response.

Parameters:item_session
Returns:A value from Actions. The default is Actions.NORMAL.
static plugin_wait_time(seconds: float, item_session: wpull.pipeline.session.ItemSession, error: typing.Union=None) → float[source]

Return the wait time between requests.

Parameters:
  • seconds – The original time in seconds.
  • item_session
  • error
Returns:

The time in seconds.

processor.web Module

Web processing.

exception wpull.processor.web.HookPreResponseBreak[source]

Bases: wpull.errors.ProtocolError

Hook pre-response break.

class wpull.processor.web.WebProcessor(web_client: wpull.protocol.http.web.WebClient, fetch_params: wpull.processor.web.WebProcessorFetchParamsType)[source]

Bases: wpull.processor.base.BaseProcessor, wpull.application.hook.HookableMixin

HTTP processor.

Parameters:
  • web_client – The web client.
  • fetch_params – Fetch parameters
DOCUMENT_STATUS_CODES = (200, 204, 206, 304)

Default status codes considered successfully fetching a document.

NO_DOCUMENT_STATUS_CODES = (401, 403, 404, 405, 410)

Default status codes considered a permanent error.

close()[source]

Close the web client.

fetch_params

The fetch parameters.

process(item_session: wpull.pipeline.session.ItemSession)[source]
web_client

The web client.

wpull.processor.web.WebProcessorFetchParams

WebProcessorFetchParams

Parameters:
  • post_data (str) – If provided, all requests will be POSTed with the given post_data. post_data must be in percent-encoded query format (“application/x-www-form-urlencoded”).
  • strong_redirects (bool) – If True, redirects are allowed to span hosts.

alias of WebProcessorFetchParamsType

class wpull.processor.web.WebProcessorSession(processor: wpull.processor.web.WebProcessor, item_session: wpull.pipeline.session.ItemSession)[source]

Bases: wpull.processor.base.BaseProcessorSession

Fetches an HTTP document.

This Processor Session will handle document redirects within the same Session. HTTP errors such as 404 are considered permanent errors. HTTP errors like 500 are considered transient errors and are handled in subsequence sessions by marking the item as “error”.

If a successful document has been downloaded, it will be scraped for URLs to be added to the URL table. This Processor Session is very simple; it cannot handle JavaScript or Flash plugins.

close()[source]

Close any temp files.

process()[source]

protocol Module

protocol.abstract Module

Conversation abstractions.

protocol.abstract.client Module

Client abstractions

class wpull.protocol.abstract.client.BaseClient(connection_pool: typing.Union=None)[source]

Bases: typing.Generic, wpull.application.hook.HookableMixin

Base client.

class ClientEvent[source]

Bases: enum.Enum

BaseClient.close()[source]

Close the connection pool.

BaseClient.session() → typing.SessionT[source]

Return a new session.

class wpull.protocol.abstract.client.BaseSession(connection_pool)[source]

Bases: wpull.application.hook.HookableMixin

Base session.

class SessionEvent[source]

Bases: enum.Enum

BaseSession.abort()[source]

Terminate early and close any connections.

BaseSession.recycle()[source]

Clean up and return connections back to the pool.

Connections should be kept alive if supported.

exception wpull.protocol.abstract.client.DurationTimeout[source]

Bases: wpull.errors.NetworkTimedOut

Download did not complete within specified time.

wpull.protocol.abstract.client.dummy_context_manager()[source]

protocol.abstract.request Module

Request object abstractions

class wpull.protocol.abstract.request.BaseRequest[source]

Bases: wpull.protocol.abstract.request.URLPropertyMixin

set_continue(offset: int)[source]
class wpull.protocol.abstract.request.BaseResponse[source]

Bases: wpull.protocol.abstract.request.ProtocolResponseMixin

class wpull.protocol.abstract.request.DictableMixin[source]

Bases: object

classmethod call_to_dict_or_none(instance)[source]

Call to_dict or return None.

to_dict()[source]

Convert to a dict suitable for JSON.

class wpull.protocol.abstract.request.ProtocolResponseMixin[source]

Bases: object

Protocol abstraction for response objects.

protocol

Name of the protocol.

Returns:Either ftp or http.
Return type:str
response_code()[source]

Response code representative for the protocol.

Returns:The status code for HTTP or the final reply code for FTP.
Return type:int
response_message()[source]

Response message representative for the protocol.

Returns:The status line reason for HTTP or a reply message for FTP.
Return type:str
class wpull.protocol.abstract.request.SerializableMixin[source]

Bases: object

Serialize and unserialize methods.

parse(data)[source]

Parse from HTTP bytes.

to_bytes()[source]

Serialize to HTTP bytes.

class wpull.protocol.abstract.request.URLPropertyMixin[source]

Bases: object

Provide URL as a property.

url

str

The complete URL string.

url_info

url.URLInfo

The URLInfo of the url attribute.

Setting url or url_info will update the other
respectively.
url
url_info

protocol.abstract.stream Module

Abstract stream classes

class wpull.protocol.abstract.stream.DataEventDispatcher[source]

Bases: object

add_read_listener(callback: typing.Callable)[source]
add_write_listener(callback: typing.Callable)[source]
notify_read(data: bytes)[source]
notify_write(data: bytes)[source]
remove_read_listener(callback: typing.Callable)[source]
remove_write_listener(callback: typing.Callable)[source]
wpull.protocol.abstract.stream.close_stream_on_error(func)[source]

Decorator to close stream on error.

protocol.ftp Module

File transfer protocol.

protocol.ftp.client Module

FTP client.

class wpull.protocol.ftp.client.Client(*args, **kwargs)[source]

Bases: wpull.protocol.abstract.client.BaseClient

FTP Client.

The session object is Session.

session() → wpull.protocol.ftp.client.Session[source]
class wpull.protocol.ftp.client.Session(login_table: weakref.WeakKeyDictionary, **kwargs)[source]

Bases: wpull.protocol.abstract.client.BaseSession

class Event[source]

Bases: enum.Enum

Session.abort()[source]
Session.download(file: typing.Union=None, rewind: bool=True, duration_timeout: typing.Union=None) → wpull.protocol.ftp.request.Response[source]

Read the response content into file.

Parameters:
  • file – A file object or asyncio stream.
  • rewind – Seek the given file back to its original offset after reading is finished.
  • duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns:

A Response populated with the final data connection reply.

Be sure to call start() first.

Coroutine.

Session.download_listing(file: typing.Union, duration_timeout: typing.Union=None) → wpull.protocol.ftp.request.ListingResponse[source]

Read file listings.

Parameters:
  • file – A file object or asyncio stream.
  • duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns:

A Response populated the file listings

Be sure to call start_file_listing() first.

Coroutine.

Session.recycle()[source]
Session.start(request: wpull.protocol.ftp.request.Request) → wpull.protocol.ftp.request.Response[source]

Start a file or directory listing download.

Parameters:request – Request.
Returns:A Response populated with the initial data connection reply.

Once the response is received, call download().

Coroutine.

Session.start_listing(request: wpull.protocol.ftp.request.Request) → wpull.protocol.ftp.request.ListingResponse[source]

Fetch a file listing.

Parameters:request – Request.
Returns:A listing response populated with the initial data connection reply.

Once the response is received, call download_listing().

Coroutine.

class wpull.protocol.ftp.client.SessionState[source]

Bases: enum.Enum

protocol.ftp.command Module

FTP service control.

class wpull.protocol.ftp.command.Commander(data_stream)[source]

Bases: object

Helper class that performs typical FTP routines.

Parameters:control_stream (ftp.stream.ControlStream) – The control stream.
begin_stream(command: wpull.protocol.ftp.request.Command) → wpull.protocol.ftp.request.Reply[source]

Start sending content on the data stream.

Parameters:
  • command – A command that tells the server to send data over the
  • connection. (data) –

Coroutine.

Returns:The begin reply.
login(username: str='anonymous', password: str='-wpull-lib@')[source]

Log in.

Coroutine.

passive_mode() → typing.Tuple[source]

Enable passive mode.

Returns:The address (IP address, port) of the passive port.

Coroutine.

classmethod raise_if_not_match(action: str, expected_code: typing.Union, reply: wpull.protocol.ftp.request.Reply)[source]

Raise FTPServerError if not expected reply code.

Parameters:
  • action – Label to use in the exception message.
  • expected_code – Expected 3 digit code.
  • reply – Reply from the server.
read_stream(file: typing.IO, data_stream: wpull.protocol.ftp.stream.DataStream) → wpull.protocol.ftp.request.Reply[source]

Read from the data stream.

Parameters:
  • file – A destination file object or a stream writer.
  • data_stream – The stream of which to read from.

Coroutine.

Returns:The final reply.
Return type:Reply
read_welcome_message()[source]

Read the welcome message.

Coroutine.

restart(offset: int)[source]

Send restart command.

Coroutine.

setup_data_stream(connection_factory: typing.Callable, data_stream_factory: typing.Callable=<class 'wpull.protocol.ftp.stream.DataStream'>) → wpull.protocol.ftp.stream.DataStream[source]

Create and setup a data stream.

This function will set up passive and binary mode and handle connecting to the data connection.

Parameters:
  • connection_factory – A coroutine callback that returns a connection
  • data_stream_factory – A callback that returns a data stream

Coroutine.

Returns:DataStream
size(filename: str) → int[source]

Get size of file.

Coroutine.

protocol.ftp.ls Module

I-tried-my-best LIST parsing package.

protocol.ftp.ls.date Module

Date and time parsing

wpull.protocol.ftp.ls.date.AM_STRINGS = {'vorm', 'पूर्व', 'a. m', 'am', '午前', '上午', 'ص'}

Set of AM day period strings.

wpull.protocol.ftp.ls.date.DAY_PERIOD_PATTERN = re.compile('(nachm|vorm|م|पूर्व|a. m|午後|अपर|pm|下午|am|p. m|午前|上午|ص)\\b', re.IGNORECASE)

Regex pattern for AM/PM string.

wpull.protocol.ftp.ls.date.ISO_8601_DATE_PATTERN = re.compile('(\\d{4})(?!\\d)[\\w./-](\\d{1,2})(?!\\d)[\\w./-](\\d{1,2})')

Regex pattern for dates similar to YYYY-MM-DD.

wpull.protocol.ftp.ls.date.MMM_DD_YY_PATTERN = re.compile('([^\\W\\d_]{3,4})\\s{0,4}(\\d{1,2})\\s{0,4}(\\d{0,4})')

Regex pattern for dates similar to MMM DD YY.

Example: Feb 09 90

wpull.protocol.ftp.ls.date.MONTH_MAP = {'أبريل': 4, 'juni': 6, 'set': 9, 'lis': 11, 'juil': 7, 'lip': 7, 'aug': 8, 'sie': 8, 'जन': 1, 'يوليو': 7, 'नवं': 11, 'lut': 2, 'oct': 10, '7月': 7, 'juin': 6, 'فبراير': 2, '3月': 3, 'dec': 12, 'मार्च': 3, 'अक्टू': 10, 'sty': 1, 'जुला': 7, 'juli': 7, 'أكتوبر': 10, 'марта': 3, 'jan': 1, 'янв': 1, 'нояб': 11, 'ديسمبر': 12, 'apr': 4, 'अग': 8, 'août': 8, 'ago': 8, 'июня': 6, 'окт': 10, 'févr': 2, 'मई': 5, '8月': 8, 'ene': 1, 'сент': 9, 'نوفمبر': 11, '9月': 9, 'nov': 11, '5月': 5, '10月': 10, 'jul': 7, 'يناير': 1, 'जून': 6, 'mars': 3, 'déc': 12, 'dez': 12, 'dic': 12, 'okt': 10, 'апр': 4, 'avr': 4, 'mai': 5, 'gru': 12, '6月': 6, 'июля': 7, '12月': 12, 'wrz': 9, 'out': 10, 'авг': 8, 'फ़र': 2, 'мая': 5, 'февр': 2, 'سبتمبر': 9, 'feb': 2, 'अप्रै': 4, 'maj': 5, 'fev': 2, 'مارس': 3, '1月': 1, 'may': 5, 'mar': 3, '4月': 4, 'jun': 6, 'दिसं': 12, 'paź': 10, 'sep': 9, 'kwi': 4, '11月': 11, '2月': 2, 'abr': 4, 'सितं': 9, 'märz': 3, 'مايو': 5, 'أغسطس': 8, 'sept': 9, 'janv': 1, 'дек': 12, 'cze': 6, 'يونيو': 6}

Month names to int.

wpull.protocol.ftp.ls.date.NN_NN_NNNN_PATTERN = re.compile('(\\d{1,2})[./-](\\d{1,2})[./-](\\d{2,4})')

Regex pattern for dates similar to NN NN YYYY.

Example: 2/9/90

wpull.protocol.ftp.ls.date.PM_STRINGS = {'nachm', 'م', '午後', 'अपर', 'pm', '下午', 'p. m'}

Set of PM day period strings.

wpull.protocol.ftp.ls.date.TIME_PATTERN = re.compile('(\\d{1,2}):(\\d{2}):?(\\d{0,2})\\s?(nachm|vorm|م|पूर्व|a. m|午後|अपर|pm|下午|am|p. m|午前|上午|ص|\x08)?')

Regex pattern for time in HH:MM[:SS]

wpull.protocol.ftp.ls.date.guess_datetime_format(lines: typing.Iterable, threshold: int=5) → typing.Tuple[source]

Guess whether order of the year, month, day and 12/24 hour.

Returns:First item is either str ymd, dmy, mdy or None. Second item is either True for 12-hour time or False for 24-hour time or None.
Return type:tuple
wpull.protocol.ftp.ls.date.parse_cldr_json(directory, language_codes=('zh', 'es', 'en', 'hi', 'ar', 'pt', 'ru', 'ja', 'de', 'fr', 'pl'), massage=True)[source]

Parse CLDR JSON datasets to for date time things.

wpull.protocol.ftp.ls.date.parse_datetime(text: str, date_format: str=None, is_day_period: typing.Union=None, datetime_now: datetime.datetime=None) → typing.Tuple[source]

Parse date/time from a line of text into datetime object.

wpull.protocol.ftp.ls.date.parse_month(text: str) → int[source]

Parse month string into integer.

wpull.protocol.ftp.ls.date.y2k(year: int) → int[source]

Convert two digit year to four digit year.

protocol.ftp.ls.listing Module

Listing parser.

wpull.protocol.ftp.ls.listing.FileEntry

A row in a listing.

wpull.protocol.ftp.ls.listing.name

str

Filename.

wpull.protocol.ftp.ls.listing.type

str, None

file, dir, symlink, other, None

wpull.protocol.ftp.ls.listing.size

int, None

Size of file.

wpull.protocol.ftp.ls.listing.date

datetime.datetime, None

A datetime object in UTC.

wpull.protocol.ftp.ls.listing.dest

str, None

Destination filename for symlinks.

wpull.protocol.ftp.ls.listing.perm

int, None

Unix permissions expressed as an integer.

alias of FileEntryType

class wpull.protocol.ftp.ls.listing.LineParser[source]

Bases: object

Parse individual lines in a listing.

guess_type(sample_lines)[source]

Guess the type of listing from a sample of lines.

parse(lines)[source]

Parse the lines.

parse_datetime(text)[source]

Parse datetime from line of text.

parse_msdos(lines)[source]

Parse lines from a MS-DOS format.

parse_nlst(lines)[source]

Parse lines from a NLST format.

parse_unix(lines)[source]

Parse listings from a Unix ls command format.

set_datetime_format(datetime_format)[source]

Set the datetime format.

exception wpull.protocol.ftp.ls.listing.ListingError[source]

Bases: ValueError

Error during parsing a listing.

class wpull.protocol.ftp.ls.listing.ListingParser(text=None, file=None)[source]

Bases: wpull.protocol.ftp.ls.listing.LineParser

Listing parser.

Parameters:
  • text (str) – A text listing.
  • file – A file object in text mode containing the listing.
parse_input()[source]

Parse the listings.

Returns:A iterable of ftp.ls.listing.FileEntry
Return type:iter
exception wpull.protocol.ftp.ls.listing.UnknownListingError[source]

Bases: wpull.protocol.ftp.ls.listing.ListingError

Failed to determine type of listing.

wpull.protocol.ftp.ls.listing.guess_listing_type(lines, threshold=100)[source]

Guess the style of directory listing.

Returns:unix, msdos, nlst, unknown.
Return type:str
wpull.protocol.ftp.ls.listing.parse_int(text)[source]

Parse a integer containing potential grouping characters.

wpull.protocol.ftp.ls.listing.parse_unix_perm(text)[source]

Parse a Unix permission string and return integer value.

protocol.ftp.request Module

FTP conversation classes

class wpull.protocol.ftp.request.Command(name=None, argument='')[source]

Bases: wpull.protocol.abstract.request.SerializableMixin, wpull.protocol.abstract.request.DictableMixin

FTP request command.

Encoding is UTF-8.

name

str

The command. Usually 4 characters or less.

argument

str

Optional argument for the command.

name
parse(data)[source]
to_bytes()[source]
to_dict()[source]
class wpull.protocol.ftp.request.ListingResponse[source]

Bases: wpull.protocol.ftp.request.Response

FTP response for a file listing.

files

list

A list of ftp.ls.listing.FileEntry

to_dict()[source]
class wpull.protocol.ftp.request.Reply(code=None, text=None)[source]

Bases: wpull.protocol.abstract.request.SerializableMixin, wpull.protocol.abstract.request.DictableMixin

FTP reply.

Encoding is always UTF-8.

code

int

Reply code.

text

str

Reply message.

code_tuple()[source]

Return a tuple of the reply code.

parse(data)[source]
to_bytes()[source]
to_dict()[source]
class wpull.protocol.ftp.request.Request(url)[source]

Bases: wpull.protocol.abstract.request.BaseRequest, wpull.protocol.abstract.request.URLPropertyMixin

FTP request for a file.

address

tuple

Address of control connection.

data_address

tuple

Address of data connection.

username

str, None

Username for login.

password

str, None

Password for login.

restart_value

int, None

Optional value for REST command.

file_path

str

Path of the file.

file_path
set_continue(offset)[source]

Modify the request into a restart request.

to_dict()[source]
class wpull.protocol.ftp.request.Response[source]

Bases: wpull.protocol.abstract.request.BaseResponse, wpull.protocol.abstract.request.DictableMixin

FTP response for a file.

request

Request

The corresponding request.

body

body.Body, file-like, None

The file.

reply

Reply

The latest Reply.

file_transfer_size

int

Size of the file transfer without considering restart. (REST is issued last.)

This is will be the file size. (STREAM mode is always used.)

restart_value

int

Offset value of restarted transfer.

protocol
response_code()[source]
response_message()[source]
to_dict()[source]

protocol.ftp.stream Module

FTP Streams

class wpull.protocol.ftp.stream.ControlStream(connection: wpull.network.connection.Connection)[source]

Bases: object

Stream class for a control connection.

Parameters:connection – Connection.
close()[source]

Close the connection.

closed() → bool[source]

Return whether the connection is closed.

data_event_dispatcher
read_reply() → wpull.protocol.ftp.request.Reply[source]

Read a reply from the stream.

Returns:The reply
Return type:ftp.request.Reply

Coroutine.

reconnect()[source]

Connected the stream if needed.

Coroutine.

write_command(command: wpull.protocol.ftp.request.Command)[source]

Write a command to the stream.

Parameters:command – The command.

Coroutine.

class wpull.protocol.ftp.stream.DataStream(connection: wpull.network.connection.Connection)[source]

Bases: object

Stream class for a data connection.

Parameters:connection – Connection.
close()[source]

Close connection.

closed() → bool[source]

Return whether the connection is closed.

data_event_dispatcher
read_file(file: typing.Union=None)[source]

Read from connection to file.

Parameters:file – A file object or a writer stream.

protocol.ftp.util Module

Utils

exception wpull.protocol.ftp.util.FTPServerError[source]

Bases: wpull.errors.ServerError

reply_code

Return reply code.

class wpull.protocol.ftp.util.ReplyCodes[source]

Bases: object

bad_sequence_of_commands = 503
cant_open_data_connection = 425
closing_data_connection = 226
command_not_implemented = 502
command_not_implemented_for_that_parameter = 504
command_not_implemented_superfluous_at_this_site = 202
command_okay = 200
connection_closed_transfer_aborted = 426
data_connection_already_open_transfer_starting = 125
data_connection_open_no_transfer_in_progress = 225
directory_status = 212
entering_passive_mode = 227
file_status = 213
file_status_okay_about_to_open_data_connection = 150
help_message = 214
name_system_type = 215
need_account_for_login = 332
need_account_for_storing_files = 532
not_logged_in = 530
pathname_created = 257
requested_action_aborted_local_error_in_processing = 451
requested_action_aborted_page_type_unknown = 551
requested_action_not_taken_file_name_not_allowed = 553
requested_action_not_taken_file_unavailable = 550
requested_action_not_taken_insufficient_storage_space = 452
requested_file_action_aborted = 552
requested_file_action_not_taken = 450
requested_file_action_okay_completed = 250
requested_file_action_pending_further_information = 350
restart_marker_reply = 110
service_closing_control_connection = 221
service_not_available_closing_control_connection = 421
service_ready_for_new_user = 220
service_ready_in_nnn_minutes = 120
syntax_error_command_unrecognized = 500
syntax_error_in_parameters_or_arguments = 501
system_status_or_system_help_reply = 211
user_logged_in_proceed = 230
user_name_okay_need_password = 331
wpull.protocol.ftp.util.convert_machine_list_time_val(text: str) → datetime.datetime[source]

Convert RFC 3659 time-val to datetime objects.

wpull.protocol.ftp.util.convert_machine_list_value(name: str, value: str) → typing.Union[source]

Convert sizes and time values.

Size will be int while time value will be datetime.datetime.

wpull.protocol.ftp.util.machine_listings_to_file_entries(listings: typing.Iterable) → typing.Iterable[source]

Convert results from parsing machine listings to FileEntry list.

wpull.protocol.ftp.util.parse_address(text: str) → typing.Tuple[source]

Parse PASV address.

wpull.protocol.ftp.util.parse_machine_listing(text: str, convert: bool=True, strict: bool=True) → typing.List[source]

Parse machine listing.

Parameters:
  • text – The listing.
  • convert – Convert sizes and dates.
  • strict – Method of handling errors. True will raise ValueError. False will ignore rows with errors.
Returns:

A list of dict of the facts defined in RFC 3659. The key names must be lowercase. The filename uses the key name.

Return type:

list

wpull.protocol.ftp.util.reply_code_tuple(code: int) → typing.Tuple[source]

Return the reply code as a tuple.

Parameters:code – The reply code.
Returns:Each item in the tuple is the digit.

protocol.http Module

HTTP Protocol.

protocol.http.chunked Module

Chunked transfer encoding.

class wpull.protocol.http.chunked.ChunkedTransferReader(connection, read_size=4096)[source]

Bases: object

Read chunked transfer encoded stream.

Parameters:connection (connection.Connection) – Established connection.
read_chunk_body()[source]

Read a fragment of a single chunk.

Call read_chunk_header() first.

Returns:2-item tuple with the content data and raw data. First item is empty bytes string when chunk is fully read.
Return type:tuple

Coroutine.

read_chunk_header()[source]

Read a single chunk’s header.

Returns:2-item tuple with the size of the content in the chunk and the raw header byte string.
Return type:tuple

Coroutine.

read_trailer()[source]

Read the HTTP trailer fields.

Returns:The trailer data.
Return type:bytes

Coroutine.

protocol.http.client Module

Basic HTTP Client.

class wpull.protocol.http.client.Client(*args, stream_factory=<class 'wpull.protocol.http.stream.Stream'>, **kwargs)[source]

Bases: wpull.protocol.abstract.client.BaseClient

Stateless HTTP/1.1 client.

The session object is Session.

session() → wpull.protocol.http.client.Session[source]
class wpull.protocol.http.client.Session(stream_factory: typing.Callable=None, **kwargs)[source]

Bases: wpull.protocol.abstract.client.BaseSession

HTTP request and response session.

class Event[source]

Bases: enum.Enum

Session.abort()[source]
Session.done() → bool[source]

Return whether the session was complete.

A session is complete when it has sent a request, read the response header and the response body.

Session.download(file: typing.Union=None, raw: bool=False, rewind: bool=True, duration_timeout: typing.Union=None)[source]

Read the response content into file.

Parameters:
  • file – A file object or asyncio stream.
  • raw – Whether chunked transfer encoding should be included.
  • rewind – Seek the given file back to its original offset after reading is finished.
  • duration_timeout – Maximum time in seconds of which the entire file must be read.

Be sure to call start() first.

Coroutine.

Session.recycle()[source]
Session.start(request: wpull.protocol.http.request.Request) → wpull.protocol.http.request.Response[source]

Begin a HTTP request

Parameters:request – Request information.
Returns:A response populated with the HTTP headers.

Once the headers are received, call download().

Coroutine.

class wpull.protocol.http.client.SessionState[source]

Bases: enum.Enum

protocol.http.redirect Module

Redirection tracking.

class wpull.protocol.http.redirect.RedirectTracker(max_redirects=20, codes=(301, 302, 303), repeat_codes=(307, 308))[source]

Bases: object

Keeps track of HTTP document URL redirects.

Parameters:
  • max_redirects (int) – The maximum number of redirects to allow.
  • codes – The HTTP status codes indicating a redirect where the method can change to “GET”.
  • repeat_codes – The HTTP status codes indicating a redirect where the method cannot change and future requests should be repeated.
REDIRECT_CODES = (301, 302, 303)
REPEAT_REDIRECT_CODES = (307, 308)
count()[source]

Return the number of redirects received so far.

exceeded()[source]

Return whether the number of redirects has exceeded the maximum.

is_redirect()[source]

Return whether the response contains a redirect code.

is_repeat()[source]

Return whether the next request should be repeated.

load(response)[source]

Load the response and increment the counter.

Parameters:response (http.request.Response) – The response from a previous request.
next_location(raw=False)[source]

Returns the next location.

Parameters:raw (bool) – If True, the original string contained in the Location field will be returned. Otherwise, the URL will be normalized to a complete URL.
Returns:If str, the location. Otherwise, no next location.
Return type:str, None

protocol.http.request Module

HTTP conversation objects.

class wpull.protocol.http.request.RawRequest(method=None, resource_path=None, version='HTTP/1.1')[source]

Bases: wpull.protocol.abstract.request.BaseRequest, wpull.protocol.abstract.request.SerializableMixin, wpull.protocol.abstract.request.DictableMixin

Represents an HTTP request.

method

str

The HTTP method in the status line. For example, GET, POST.

resource_path

str

The URL or “path” in the status line.

version

str

The HTTP version in the status line. For example, HTTP/1.0.

fields

namevalue.NameValueRecord

The fields in the HTTP header.

body

body.Body, file-like, None

An optional payload.

encoding

str

The encoding of the status line.

copy()[source]

Return a copy.

parse(data)[source]
parse_status_line(data)[source]

Parse the status line bytes.

Returns:An tuple representing the method, URI, and version.
Return type:tuple
set_continue(offset)[source]

Modify the request into a range request.

to_bytes()[source]
to_dict()[source]
class wpull.protocol.http.request.Request(url=None, method='GET', version='HTTP/1.1')[source]

Bases: wpull.protocol.http.request.RawRequest

Represents a higher level of HTTP request.

address

tuple

An address tuple suitable for socket.connect().

username

str

Username for HTTP authentication.

password

str

Password for HTTP authentication.

parse(data)[source]
prepare_for_send(full_url=False)[source]

Modify the request to be suitable for HTTP server.

Parameters:full_url (bool) – Use full URL as the URI. By default, only the path of the URL is given to the server.
to_dict()[source]
class wpull.protocol.http.request.Response(status_code=None, reason=None, version='HTTP/1.1', request=None)[source]

Bases: wpull.protocol.abstract.request.BaseResponse, wpull.protocol.abstract.request.SerializableMixin, wpull.protocol.abstract.request.DictableMixin

Represents the HTTP response.

status_code

int

The status code in the status line.

status_reason

str

The status reason string in the status line.

version

str

The HTTP version in the status line. For example, HTTP/1.1.

fields

namevalue.NameValueRecord

The fields in the HTTP headers (and trailer, if present).

body

body.Body, file-like, None

The optional payload (without and transfer or content encoding).

request

The corresponding request.

encoding

str

The encoding of the status line.

parse(data)[source]
classmethod parse_status_line(data)[source]

Parse the status line bytes.

Returns:An tuple representing the version, code, and reason.
Return type:tuple
protocol
response_code()[source]
response_message()[source]
to_bytes()[source]
to_dict()[source]

protocol.http.robots Module

Robots.txt file logistics.

exception wpull.protocol.http.robots.NotInPoolError[source]

Bases: Exception

The URL is not in the pool.

class wpull.protocol.http.robots.RobotsTxtChecker(web_client: wpull.protocol.http.web.WebClient=None, robots_txt_pool: wpull.robotstxt.RobotsTxtPool=None)[source]

Bases: object

Robots.txt file fetcher and checker.

Parameters:
  • web_client – Web Client.
  • robots_txt_pool – Robots.txt Pool.
can_fetch(request: wpull.protocol.http.request.Request, file=None) → bool[source]

Return whether the request can fetched.

Parameters:
  • request – Request.
  • file – A file object to where the robots.txt contents are written.

Coroutine.

can_fetch_pool(request: wpull.protocol.http.request.Request)[source]

Return whether the request can be fetched based on the pool.

fetch_robots_txt(request: wpull.protocol.http.request.Request, file=None)[source]

Fetch the robots.txt file for the request.

Coroutine.

robots_txt_pool

Return the RobotsTxtPool.

web_client

Return the WebClient.

protocol.http.stream Module

HTML protocol streamers.

wpull.protocol.http.stream.DEFAULT_NO_CONTENT_CODES = frozenset({100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 204, 304})

Status codes where a response body is prohibited.

class wpull.protocol.http.stream.Stream(connection, keep_alive=True, ignore_length=False)[source]

Bases: object

HTTP stream reader/writer.

Parameters:
  • connection (connection.Connection) – An established connection.
  • keep_alive (bool) – If True, use HTTP keep-alive.
  • ignore_length (bool) – If True, Content-Length headers will be ignored. When using this option, keep_alive should be False.
connection

The underlying connection.

close()[source]

Close the connection.

closed()[source]

Return whether the connection is closed.

connection
data_event_dispatcher
classmethod get_read_strategy(response)[source]

Return the appropriate algorithm of reading response.

Returns:chunked, length, close.
Return type:str
read_body(request, response, file=None, raw=False)[source]

Read the response’s content body.

Coroutine.

read_response(response=None)[source]

Read the response’s HTTP status line and header fields.

Coroutine.

reconnect()[source]

Connect the connection if needed.

Coroutine.

write_body(file, length=None)[source]

Send the request’s content body.

Coroutine.

write_request(request, full_url=False)[source]

Send the request’s HTTP status line and header fields.

This class will automatically connect the connection if the connection is closed.

Coroutine.

wpull.protocol.http.stream.is_no_body(request, response, no_content_codes=frozenset({100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 204, 304}))[source]

Return whether a content body is not expected.

protocol.http.util Module

Miscellaneous HTTP functions.

wpull.protocol.http.util.parse_charset(header_string)[source]

Parse a “Content-Type” string for the document encoding.

Returns:str, None
wpull.protocol.http.util.should_close(http_version, connection_field)[source]

Return whether the connection should be closed.

Parameters:
  • http_version (str) – The HTTP version string like HTTP/1.0.
  • connection_field (str) – The value for the Connection header.

protocol.http.web Module

Advanced HTTP Client handling.

class wpull.protocol.http.web.LoopType[source]

Bases: enum.Enum

Indicates the type of request and response.

authentication = None

Response to a HTTP authentication.

normal = None

Normal response.

redirect = None

Redirect.

robots = None

Response to a robots.txt request.

class wpull.protocol.http.web.WebClient(http_client: typing.Union=None, request_factory: typing.Callable=<class 'wpull.protocol.http.request.Request'>, redirect_tracker_factory: typing.Union=<class 'wpull.protocol.http.redirect.RedirectTracker'>, cookie_jar: typing.Union=None)[source]

Bases: object

A web client handles redirects, cookies, basic authentication.

Parameters:
close()[source]
cookie_jar

Return the Cookie Jar.

http_client

Return the HTTP Client.

redirect_tracker_factory

Return the Redirect Tracker factory.

request_factory

Return the Request factory.

session(request: wpull.protocol.http.request.Request) → wpull.protocol.http.web.WebSession[source]

Return a fetch session.

Parameters:request – The request to be fetched.

Example usage:

client = WebClient()
session = client.session(Request('http://www.example.com'))

with session:
    while not session.done():
        request = session.next_request()
        print(request)

        response = yield from session.start()
        print(response)

        if session.done():
            with open('myfile.html') as file:
                yield from session.download(file)
        else:
            yield from session.download()
Returns:WebSession
class wpull.protocol.http.web.WebSession(request: wpull.protocol.http.request.Request, http_client: wpull.protocol.http.client.Client, redirect_tracker: wpull.protocol.http.redirect.RedirectTracker, request_factory: typing.Callable, cookie_jar: typing.Union=None)[source]

Bases: object

A web session.

done() → bool[source]

Return whether the session has finished.

Returns:If True, the document has been fully fetched.
Return type:bool
download(file: typing.Union=None, duration_timeout: typing.Union=None)[source]

Download content.

Parameters:
  • file – An optional file object for the document contents.
  • duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns:

An instance of http.request.Response.

Return type:

Response

See WebClient.session() for proper usage of this function.

Coroutine.

loop_type() → wpull.protocol.http.web.LoopType[source]

Return the type of response.

Seealso:LoopType.
next_request() → typing.Union[source]

Return the next Request to be fetched.

redirect_tracker

Return the Redirect Tracker.

start()[source]

Begin fetching the next request.

proxy Module

proxy.client Module

Proxy support for HTTP requests.

class wpull.proxy.client.HTTPProxyConnectionPool(proxy_address, *args, proxy_ssl=False, authentication=None, ssl_context=True, host_filter=None, **kwargs)[source]

Bases: wpull.network.pool.ConnectionPool

Establish pooled connections to a HTTP proxy.

Parameters:
  • proxy_address (tuple) – Tuple containing host and port of the proxy server.
  • connection_pool (connection.ConnectionPool) – Connection pool
  • proxy_ssl (bool) – Whether to connect to the proxy using HTTPS.
  • authentication (tuple) – Tuple containing username and password.
  • ssl_context – SSL context for SSL connections on TCP tunnels.
  • host_filter (proxy.hostfilter.HostFilter) – Host filter which for deciding whether a connection is routed through the proxy. A test result that returns True is routed through the proxy.
acquire(host, port, use_ssl=False, host_key=None)[source]
acquire_proxy(host, port, use_ssl=False, host_key=None, tunnel=True)[source]

Check out a connection.

This function is the same as acquire but with extra arguments concerning proxies.

Coroutine.

add_auth_header(request)[source]

Add the username and password to the HTTP request.

no_wait_release(proxy_connection)[source]
release(proxy_connection)[source]

proxy.hostfilter Module

Host filtering.

class wpull.proxy.hostfilter.HostFilter(accept_domains=None, reject_domains=None, accept_hostnames=None, reject_hostnames=None)[source]

Bases: object

Accept or reject hostnames.

classmethod suffix_match(domain_list, target_domain)[source]
test(host)[source]

proxy.server Module

Proxy Tools

class wpull.proxy.server.HTTPProxyServer(http_client: wpull.protocol.http.client.Client)[source]

Bases: wpull.application.hook.HookableMixin

HTTP proxy server for use with man-in-the-middle recording.

This function is meant to be used as a callback:

asyncio.start_server(HTTPProxyServer(HTTPClient))
Parameters:http_client (http.client.Client) – The HTTP client.
request_callback

A callback function that accepts a Request.

pre_response_callback

A callback function that accepts a Request and Response

response_callback

A callback function that accepts a Request and Response

class Event[source]

Bases: enum.Enum

class wpull.proxy.server.HTTPProxySession(http_client: wpull.protocol.http.client.Client, reader: asyncio.streams.StreamReader, writer: asyncio.streams.StreamWriter)[source]

Bases: wpull.application.hook.HookableMixin

class Event[source]

Bases: enum.Enum

regexstream Module

Regular expression streams.

class wpull.regexstream.RegexStream(file, pattern, read_size=16384, overlap_size=4096)[source]

Bases: object

Streams file with regular expressions.

Parameters:
  • file – File object.
  • pattern – A compiled regular expression object.
  • read_size (int) – The size of a chunk of text that is searched.
  • overlap_size (int) – The amount of overlap between chunks of text that is searched.
stream()[source]

Iterate the file stream.

Returns:Each item is a tuple:
  1. None, regex match
  2. str
Return type:iterator

resmon Module

Resource monitor.

wpull.resmon.ResourceInfo

Resource level information

wpull.resmon.path

str, None

File path of the resource. None is provided for memory usage.

wpull.resmon.free

int

Number of bytes available.

wpull.resmon.limit

int

Minimum bytes of the resource.

alias of ResourceInfoType

class wpull.resmon.ResourceMonitor(resource_paths=('/', ), min_disk=10000, min_memory=10000)[source]

Bases: object

Monitor available resources such as disk space and memory.

Parameters:
  • resource_paths (list) – List of paths to monitor. Recommended paths include temporary directories and the current working directory.
  • min_disk (int, optional) – Minimum disk space in bytes.
  • min_memory (int, optional) – Minimum memory in bytes.
check()[source]

Check resource levels.

Returns:If None is provided, no levels are exceeded. Otherwise, the first ResourceInfo exceeding limits is returned.
Return type:None, ResourceInfo
get_info()[source]

Return ResourceInfo instances.

robotstxt Module

Robots.txt exclusion directives.

class wpull.robotstxt.RobotsTxtPool[source]

Bases: object

Pool of robots.txt parsers.

can_fetch(url_info: wpull.url.URLInfo, user_agent: str)[source]

Return whether the URL can be fetched.

has_parser(url_info: wpull.url.URLInfo)[source]

Return whether a parser has been created for the URL.

load_robots_txt(url_info: wpull.url.URLInfo, text: str)[source]

Load the robot.txt file.

classmethod url_info_key(url_info: wpull.url.URLInfo) → tuple[source]

scraper Module

Document scrapers.

scraper.base Module

Base classes

class wpull.scraper.base.BaseExtractiveScraper[source]

Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseExtractiveReader

Return the links.

Returns:Each item is a str which represents a link.
Return type:iterator
class wpull.scraper.base.BaseHTMLScraper[source]

Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseHTMLReader

class wpull.scraper.base.BaseScraper[source]

Bases: object

Base class for scrapers.

scrape(request, response, link_type=None)[source]

Extract the URLs from the document.

Parameters:
Returns:

LinkContexts and document information.

If None, then the scraper does not support scraping the document.

Return type:

ScrapeResult, None

class wpull.scraper.base.BaseTextStreamScraper[source]

Bases: wpull.scraper.base.BaseScraper, wpull.document.base.BaseTextStreamReader

Base class for scrapers that process either link and non-link text.

Return the links.

This function is a convenience function for calling iter_processed_text() and returning only the links.

iter_processed_text(file, encoding=None, base_url=None)[source]

Return the file text and processed absolute links.

Parameters:
  • file – A file object containing the document.
  • encoding (str) – The encoding of the document.
  • base_url (str) – The URL at which the document is located.
Returns:

Each item is a tuple:

  1. str: The text
  2. bool: Whether the text a link

Return type:

iterator

Convenience function for scraping from a text string.

class wpull.scraper.base.DemuxDocumentScraper(document_scrapers)[source]

Bases: wpull.scraper.base.BaseScraper

Puts multiple Document Scrapers into one.

scrape(request, response, link_type=None)[source]

Iterate the scrapers, returning the first of the results.

scrape_info(request, response, link_type=None)[source]

Iterate the scrapers and return a dict of results.

Returns:A dict where the keys are the scrapers instances and the values are the results. That is, a mapping from BaseDocumentScraper to ScrapeResult.
Return type:dict
wpull.scraper.base.LinkContext

A named tuple describing a scraped link.

str

The link that was scraped.

wpull.scraper.base.inline

bool

Whether the link is an embeded object.

wpull.scraper.base.linked

bool

Whether the link links to another page.

A value from item.LinkType.

wpull.scraper.base.extra

Any extra info.

alias of LinkContextType

class wpull.scraper.base.ScrapeResult(link_contexts, encoding)[source]

Bases: dict

Links scraped from a document.

This class is subclassed from dict and contains convenience methods.

encoding

Character encoding of the document.

inline

Link Context of objects embedded in the document.

URLs of objects embedded in the document.

Link Contexts.

linked

Link Context of objects linked from the document

URLs of objects linked from the document

scraper.css Module

Stylesheet scraper.

class wpull.scraper.css.CSSScraper(encoding_override=None)[source]

Bases: wpull.document.css.CSSReader, wpull.scraper.base.BaseTextStreamScraper

Scrapes CSS stylesheet documents.

iter_processed_text(file, encoding=None, base_url=None)[source]
scrape(request, response, link_type=None)[source]

scraper.html Module

HTML link extractor.

class wpull.scraper.html.ElementWalker(css_scraper=None, javascript_scraper=None)[source]

Bases: object

ATTR_HTML = 2

Flag for links that point to other documents.

ATTR_INLINE = 1

Flag for embedded objects (like images, stylesheets) in documents.

DYNAMIC_ATTRIBUTES = ('onkey', 'oncli', 'onmou')

Attributes that contain JavaScript.

HTML element attributes that may contain links.

Iterate elements looking for links.

Parameters:
  • css_scraper (scraper.css.CSSScraper) – Optional CSS scraper.
  • ( (javascript_scraper) – class:`.scraper.javascript.JavaScriptScraper): Optional JavaScript scraper.
OPEN_GRAPH_MEDIA_NAMES = ('og:image', 'og:audio', 'og:video', 'twitter:image:src', 'twitter:image0', 'twitter:image1', 'twitter:image2', 'twitter:image3', 'twitter:player:stream')
TAG_ATTRIBUTES = {'bgsound': {'src': 1}, 'body': {'background': 1}, 'input': {'src': 1}, 'area': {'href': 2}, 'iframe': {'src': 3}, 'applet': {'code': 1}, 'script': {'src': 1}, 'embed': {'href': 2, 'src': 3}, 'overlay': {'src': 3}, 'a': {'href': 2}, 'object': {'data': 1}, 'form': {'action': 2}, 'table': {'background': 1}, 'th': {'background': 1}, 'td': {'background': 1}, 'layer': {'src': 3}, 'fig': {'src': 1}, 'frame': {'src': 3}, 'img': {'href': 1, 'lowsrc': 1, 'src': 1}}

Mapping of element tag names to attributes containing links.

Return whether the link is likely to be external object.

Return whether the link is likely to be inline object.

Iterate the document root for links.

Returns:A iterator of LinkedInfo.
Return type:iterable

Iterate an element by looking at its attributes for links.

Iterate links of a JavaScript pseudo-link attribute.

Iterate a HTML element.

Get the element text as a link.

Iterate a link for URLs.

This function handles stylesheets and icons in addition to standard scraping rules.

Iterate the meta element for links.

This function handles refresh URLs.

Iterate object and embed elements.

This function also looks at codebase and archive attributes.

Iterate a param element.

Iterate any element for links using generic rules.

Iterate a script element.

Iterate a style element.

classmethod robots_cannot_follow(element)[source]

Return whether we cannot follow links due to robots.txt directives.

class wpull.scraper.html.HTMLScraper(html_parser, element_walker, followed_tags=None, ignored_tags=None, robots=False, only_relative=False, encoding_override=None)[source]

Bases: wpull.document.html.HTMLReader, wpull.scraper.base.BaseHTMLScraper

Scraper for HTML documents.

Parameters:
  • (class (element_walker) – .document.htmlparse.base.BaseParser): An HTML parser such as the lxml or html5lib one.
  • (classElementWalker): HTML element walker.
  • followed_tags – A list of tags that should be scraped
  • ignored_tags – A list of tags that should not be scraped
  • robots – If True, discard any links if they cannot be followed
  • only_relative – If True, discard any links that are not absolute paths
scrape(request, response, link_type=None)[source]
scrape_file(file, encoding=None, base_url=None)[source]

Scrape a file for links.

See scrape() for the return value.

wpull.scraper.html.LinkInfo

Information about a link in a lxml document.

wpull.scraper.html.element

An instance of document.HTMLReadElement.

wpull.scraper.html.tag

str

The element tag name.

wpull.scraper.html.attrib

str, None

If str, the name of the attribute. Otherwise, the link was found in element.text.

str

The link found.

wpull.scraper.html.inline

bool

Whether the link is an embedded object (like images or stylesheets).

wpull.scraper.html.linked

bool

Whether the link is a link to another page.

str, None

The base URL.

wpull.scraper.html.value_type

str

Indicates how the link was found. Possible values are

  • plain: The link was found plainly in an attribute value.
  • list: The link was found in a space separated list.
  • css: The link was found in a CSS text.
  • refresh: The link was found in a refresh meta string.
  • script: The link was found in JavaScript text.
  • srcset: The link was found in a srcset attribute.

A value from item.LinkInfo.

alias of LinkInfoType

scraper.javascript Module

Javascript scraper.

class wpull.scraper.javascript.JavaScriptScraper(encoding_override=None)[source]

Bases: wpull.document.javascript.JavaScriptReader, wpull.scraper.base.BaseTextStreamScraper

Scrapes JavaScript documents.

iter_processed_text(file, encoding=None, base_url=None)[source]
scrape(request, response, link_type=None)[source]

scraper.sitemap Module

Sitemap scraper

class wpull.scraper.sitemap.SitemapScraper(html_parser, encoding_override=None)[source]

Bases: wpull.document.sitemap.SitemapReader, wpull.scraper.base.BaseExtractiveScraper

Scrape Sitemaps

scrape(request, response, link_type=None)[source]

scraper.util Module

Misc functions.

Strip whitespace from a link in HTML soup.

Parameters:link (str) – A string containing the link with lots of whitespace.

The link is split into lines. For each line, leading and trailing whitespace is removed and tabs are removed throughout. The lines are concatenated and returned.

For example, passing the href value of:

<a href=" http://example.com/

        blog/entry/

    how smaug stole all the bitcoins.html
">

will return http://example.com/blog/entry/how smaug stole all the bitcoins.html.

Returns:The cleaned link.
Return type:str

Return link type guessed by filename extension.

Returns:A value from item.LinkType.
Return type:str
wpull.scraper.util.is_likely_inline(link)[source]

Return whether the link is likely to be inline.

Return whether the text is likely to be a link.

This function assumes that leading/trailing whitespace has already been removed.

Returns:bool

Return whether the text is likely to cause false positives.

This function assumes that leading/trailing whitespace has already been removed.

Returns:bool
wpull.scraper.util.parse_refresh(text)[source]

Parses text for HTTP Refresh URL.

Returns:str, None
wpull.scraper.util.urljoin_safe(base_url, url, allow_fragments=True)[source]

urljoin with warning log on error.

Returns:str, None

stats Module

Statistics.

class wpull.stats.Statistics(url_table: typing.Union=None)[source]

Bases: object

Statistics.

start_time

float

Timestamp when the engine started.

stop_time

float

Timestamp when the engine stopped.

files

int

Number of files downloaded.

size

int

Size of files in bytes.

errors

a Counter mapping error types to integer.

quota

int

Threshold of number of bytes when the download quota is exceeded.

bandwidth_meter

network.BandwidthMeter

The bandwidth meter.

duration

Return the time in seconds the interval.

increment(size: int)[source]

Increment the number of files downloaded.

Parameters:size – The size of the file
increment_error(error: Exception)[source]

Increment the error counter preferring base exceptions.

is_quota_exceeded

Return whether the quota is exceeded.

start()[source]

Record the start time.

stop()[source]

Record the stop time.

string Module

String and binary data functions.

wpull.string.coerce_str_to_ascii(string)[source]

Force the contents of the string to be ASCII.

Anything not ASCII will be replaced with with a replacement character.

Deprecated since version 0.1002: Use printable_str() instead.

wpull.string.detect_encoding(data, encoding=None, fallback='latin1', is_html=False)[source]

Detect the character encoding of the data.

Returns:

The name of the codec

Return type:

str

Raises:
  • ValueError – The codec could not be detected. This error can only
  • occur if fallback is not a “lossless” codec.
wpull.string.format_size(num, format_str='{num:.1f} {unit}')[source]

Format the file size into a human readable text.

http://stackoverflow.com/a/1094933/1524507

wpull.string.normalize_codec_name(name)[source]

Return the Python name of the encoder/decoder

Returns:str, None
wpull.string.printable_bytes(data)[source]

Remove any bytes that is not printable ASCII.

This function is intended for sniffing content types such as UTF-16 encoded text.

wpull.string.printable_str(text, keep_newlines=False)[source]

Escape any control or non-ASCII characters from string.

This function is intended for use with strings from an untrusted source such as writing to a console or writing to logs. It is designed to prevent things like ANSI escape sequences from showing.

Use repr() or ascii() instead for things such as Exception messages.

wpull.string.to_bytes(instance, encoding='utf-8', error='strict')[source]

Convert an instance recursively to bytes.

wpull.string.to_str(instance, encoding='utf-8')[source]

Convert an instance recursively to string.

wpull.string.try_decoding(data, encoding)[source]

Return whether the Python codec could decode the data.

url Module

URL parsing based on WHATWG URL living standard.

wpull.url.C0_CONTROL_SET = frozenset({'\x02', '\x17', '\x1c', '\x01', '\x1f', '\x1a', '\x04', '\x15', '\x1b', '\x1e', '\x1d', '\x07', '\r', '\x05', '\x10', '\x0e', '\x16', '\x18', '\x00', '\n', '\x14', '\x06', '\x12', '\x0c', '\x13', '\x19', '\x08', '\x03', '\x0f', '\t', '\x0b', '\x11'})

Characters from 0x00 to 0x1f inclusive

wpull.url.DEFAULT_ENCODE_SET = frozenset({32, 96, 34, 35, 60, 62, 63})

Percent encoding set as defined by WHATWG URL living standard.

Does not include U+0000 to U+001F nor U+001F or above.

wpull.url.FORBIDDEN_HOSTNAME_CHARS = frozenset({'\\', ':', '#', ' ', '%', ']', '?', '@', '/', '['})

Forbidden hostname characters.

Does not include non-printing characters. Meant for ASCII.

wpull.url.FRAGMENT_ENCODE_SET = frozenset({32, 96, 34, 60, 62})

Encoding set for fragment.

wpull.url.PASSWORD_ENCODE_SET = frozenset({32, 96, 34, 35, 64, 47, 60, 92, 62, 63})

Encoding set for passwords.

class wpull.url.PercentEncoderMap(encode_set)[source]

Bases: collections.defaultdict

Helper map for percent encoding.

wpull.url.QUERY_ENCODE_SET = frozenset({96, 34, 35, 60, 62})

Encoding set for query strings.

This set does not include U+0020 (space) so it can be replaced with U+0043 (plus sign) later.

wpull.url.QUERY_VALUE_ENCODE_SET = frozenset({96, 34, 35, 37, 38, 43, 60, 62})

Encoding set for a query value.

class wpull.url.URLInfo[source]

Bases: object

Represent parts of a URL.

raw

str

Original string.

scheme

str

Protocol (for example, HTTP, FTP).

authority

str

Raw userinfo and host.

path

str

Location of resource. This value always begins with a slash (/).

query

str

Additional request parameters.

fragment

str

Named anchor of a document.

userinfo

str

Raw username and password.

username

str

Username.

password

str

Password.

host

str

Raw hostname and port.

hostname

str

Hostname or IP address.

port

int

IP address port number.

resource

int

Raw path, query, and fragment. This value always begins with a slash (/).

query_map

dict

Mapping of the query. Values are lists.

url

str

A normalized URL without userinfo and fragment.

encoding

str

Codec name for IRI support.

If scheme is not something like HTTP or FTP, the remaining attributes are None.

All attributes are read only.

For more information about how the URL parts are derived, see https://medialize.github.io/URI.js/about-uris.html

authority
encoding
fragment
host
hostname
hostname_with_port

Return the host portion but omit default port if needed.

is_ipv6()[source]

Return whether the URL is IPv6.

is_port_default()[source]

Return whether the URL is using the default port.

classmethod parse(url, default_scheme='http', encoding='utf-8')[source]

Parse a URL and return a URLInfo.

classmethod parse_authority(authority)[source]

Parse the authority part and return userinfo and host.

classmethod parse_host(host)[source]

Parse the host and return hostname and port.

classmethod parse_hostname(hostname)[source]

Parse the hostname and normalize.

classmethod parse_ipv6_hostname(hostname)[source]

Parse and normalize a IPv6 address.

classmethod parse_userinfo(userinfo)[source]

Parse the userinfo and return username and password.

password
path
port
query
query_map
raw
resource
scheme
split_path()[source]

Return the directory and filename from the path.

The results are not percent-decoded.

to_dict()[source]

Return a dict of the attributes.

url
userinfo
username
wpull.url.USERNAME_ENCODE_SET = frozenset({32, 96, 34, 35, 64, 47, 58, 60, 92, 62, 63})

Encoding set for usernames.

wpull.url.flatten_path(path, flatten_slashes=False)[source]

Flatten an absolute URL path by removing the dot segments.

urllib.parse.urljoin() has some support for removing dot segments, but it is conservative and only removes them as needed.

Parameters:
  • path (str) – The URL path.
  • flatten_slashes (bool) – If True, consecutive slashes are removed.

The path returned will always have a leading slash.

wpull.url.is_subdir(base_path, test_path, trailing_slash=False, wildcards=False)[source]

Return whether the a path is a subpath of another.

Parameters:
  • base_path – The base path
  • test_path – The path which we are testing
  • trailing_slash – If True, the trailing slash is treated with importance. For example, /images/ is a directory while /images is a file.
  • wildcards – If True, globbing wildcards are matched against paths
wpull.url.normalize(url, **kwargs)[source]

Normalize a URL.

This function is a convenience function that is equivalent to:

>>> URLInfo.parse('http://example.com').url
'http://example.com'
Seealso:URLInfo.parse().
wpull.url.normalize_fragment(text, encoding='utf-8')[source]

Normalize a fragment.

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_hostname(hostname)[source]

Normalizes a hostname so that it is ASCII and valid domain name.

wpull.url.normalize_ipv4_address(address)[source]
wpull.url.normalize_password(text, encoding='utf-8')[source]

Normalize a password

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_path(path, encoding='utf-8')[source]

Normalize a path string.

Flattens a path by removing dot parts, percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_query(text, encoding='utf-8')[source]

Normalize a query string.

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.normalize_username(text, encoding='utf-8')[source]

Normalize a username

Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.

wpull.url.parse_ipv4_int(text)[source]
wpull.url.parse_url_or_log(url, encoding='utf-8')[source]

Parse and return a URLInfo.

This function logs a warning if the URL cannot be parsed and returns None.

wpull.url.percent_encode(text, encode_set=frozenset({32, 96, 34, 35, 60, 62, 63}), encoding='utf-8')[source]

Percent encode text.

Unlike Python’s quote, this function accepts a blacklist instead of a whitelist of safe characters.

wpull.url.percent_encode_plus(text, encode_set=frozenset({96, 34, 35, 60, 62}), encoding='utf-8')[source]

Percent encode text for query strings.

Unlike Python’s quote_plus, this function accepts a blacklist instead of a whitelist of safe characters.

wpull.url.percent_encode_query_value(text, encoding='utf-8')[source]

Percent encode a query value.

wpull.url.query_to_map(text)[source]

Return a key-values mapping from a query string.

Plus symbols are replaced with spaces.

wpull.url.schemes_similar(scheme1, scheme2)[source]

Return whether URL schemes are similar.

This function considers the following schemes to be similar:

  • HTTP and HTTPS
wpull.url.split_query(qs, keep_blank_values=False)[source]

Split the query string.

Note for empty values: If an equal sign (=) is present, the value will be an empty string (''). Otherwise, the value will be None:

>>> list(split_query('a=&b', keep_blank_values=True))
[('a', ''), ('b', None)]

No processing is done on the actual values.

wpull.url.uppercase_percent_encoding(text)[source]

Uppercases percent-encoded sequences.

wpull.url.urljoin(base_url, url, allow_fragments=True)[source]

Join URLs like urllib.parse.urljoin but allow scheme-relative URL.

urlfilter Module

URL filters.

class wpull.urlfilter.BackwardDomainFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Return whether the hostname matches a list of hostname suffixes.

classmethod match(domain_list, test_domain)[source]
test(url_info, url_table_record)[source]
class wpull.urlfilter.BackwardFilenameFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that match the filename suffixes.

classmethod match(suffix_list, test_filename)[source]
test(url_info, url_table_record)[source]
class wpull.urlfilter.BaseURLFilter[source]

Bases: object

Base class for URL filters.

The Processor uses filters to determine whether a URL should be downloaded.

test(url_info: wpull.url.URLInfo, url_record: wpull.pipeline.item.URLRecord) → bool[source]

Return whether the URL should be downloaded.

Parameters:
  • url_info – URL to be tested.
  • url_record – Fetch metadata about the URL.
Returns:

If True, the filter passed and the URL should be downloaded.

class wpull.urlfilter.DemuxURLFilter(url_filters: typing.Iterator)[source]

Bases: wpull.urlfilter.BaseURLFilter

Puts multiple url filters into one.

test(url_info, url_table_record)[source]
test_info(url_info, url_table_record) → dict[source]

Returns info about which filters passed or failed.

Returns:A dict containing the keys:
  • verdict (bool): Whether all the tests passed.
  • passed (set): A set of URLFilters that passed.
  • failed (set): A set of URLFilters that failed.
  • map (dict): A mapping from URLFilter class name (str) to the verdict (bool).
Return type:dict
url_filters
class wpull.urlfilter.DirectoryFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that match a directory path part.

test(url_info, url_table_record)[source]
class wpull.urlfilter.FollowFTPFilter(follow=False)[source]

Bases: wpull.urlfilter.BaseURLFilter

Follow links to FTP URLs.

test(url_info, url_table_record)[source]
class wpull.urlfilter.HTTPSOnlyFilter[source]

Bases: wpull.urlfilter.BaseURLFilter

Allow URL if the URL is HTTPS.

test(url_info, url_table_record)[source]
class wpull.urlfilter.HostnameFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Return whether the hostname matches exactly in a list.

test(url_info, url_table_record)[source]
class wpull.urlfilter.LevelFilter(max_depth, inline_max_depth=5)[source]

Bases: wpull.urlfilter.BaseURLFilter

Allow URLs up to a level of recursion.

test(url_info, url_table_record)[source]
class wpull.urlfilter.ParentFilter[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that descend up parent paths.

test(url_info, url_table_record)[source]
class wpull.urlfilter.RecursiveFilter(enabled=False, page_requisites=False)[source]

Bases: wpull.urlfilter.BaseURLFilter

Return True if recursion is used.

test(url_info, url_table_record)[source]
class wpull.urlfilter.RegexFilter(accepted=None, rejected=None)[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that match a regular expression.

test(url_info, url_table_record)[source]
class wpull.urlfilter.SchemeFilter(allowed=('http', 'https', 'ftp'))[source]

Bases: wpull.urlfilter.BaseURLFilter

Allow URL if the URL is in list.

test(url_info, url_table_record)[source]
class wpull.urlfilter.SpanHostsFilter(hostnames, enabled=False, page_requisites=False, linked_pages=False)[source]

Bases: wpull.urlfilter.BaseURLFilter

Filter URLs that go to other hostnames.

test(url_info, url_table_record)[source]
class wpull.urlfilter.TriesFilter(max_tries)[source]

Bases: wpull.urlfilter.BaseURLFilter

Allow URLs that have been attempted up to a limit of tries.

test(url_info, url_table_record)[source]

urlrewrite Module

URL rewriting.

class wpull.urlrewrite.URLRewriter(hash_fragment: bool=False, session_id: bool=False)[source]

Bases: object

Clean up URLs.

rewrite(url_info: wpull.url.URLInfo) → wpull.url.URLInfo[source]

Rewrite the given URL.

wpull.urlrewrite.strip_path_session_id(path)[source]

Strip session ID from URL path.

wpull.urlrewrite.strip_query_session_id(query)[source]

util Module

Miscellaneous functions.

class wpull.util.ASCIIStreamWriter(stream, errors='backslashreplace')[source]

Bases: codecs.StreamWriter

A Stream Writer that encodes everything to ASCII.

By default, the replacement character is a Python backslash sequence.

DEFAULT_ERROR = 'backslashreplace'
decode(instance, errors='backslashreplace')[source]
encode(instance, errors='backslashreplace')[source]
write(instance)[source]
writelines(list_instance)[source]
class wpull.util.GzipPickleStream(filename=None, file=None, mode='rb', **kwargs)[source]

Bases: wpull.util.PickleStream

gzip compressed pickle stream.

close()[source]
class wpull.util.PickleStream(filename=None, file=None, mode='rb', protocol=3)[source]

Bases: object

Pickle stream helper.

close()[source]

Close stream.

dump(obj)[source]

Pickle an object.

iter_load()[source]

Unpickle objects.

load()[source]

Unpickle an object.

wpull.util.close_on_error(close_func)[source]

Context manager to close object on error.

wpull.util.datetime_str()[source]

Return the current time in simple ISO8601 notation.

wpull.util.filter_pem(data)[source]

Processes the bytes for PEM certificates.

Returns:set containing each certificate
wpull.util.get_exception_message(instance)[source]

Try to get the exception message or the class name.

wpull.util.get_package_data(filename, mode='rb')[source]

Return the contents of a real file or a zip file.

wpull.util.get_package_filename(filename, package_dir=None)[source]

Return the filename of the data file.

wpull.util.grouper(iterable, n, fillvalue=None)[source]

Collect data into fixed-length chunks or blocks

wpull.util.is_ascii(text)[source]

Returns whether the given string is ASCII.

wpull.util.parse_iso8601_str(string)[source]

Parse a fixed ISO8601 datetime string.

Note

This function only parses dates in the format %Y-%m-%dT%H:%M:%SZ. You must use a library like dateutils to properly parse dates and times.

Returns:A UNIX timestamp.
Return type:float
wpull.util.peek_file(file, length=4096)[source]

Peek the file by calling read on it.

wpull.util.python_version()[source]

Return the Python version as a string.

wpull.util.reset_file_offset(file)[source]

Reset the file offset back to original position.

wpull.util.rewrap_bytes(data)[source]

Rewrap characters to 70 character width.

Intended to rewrap base64 content.

wpull.util.seek_file_end(file)[source]

Seek to the end of the file.

wpull.util.truncate_file(path)[source]

Truncate the file.

version Module

Version information.

wpull.version.__version__

A string conforming to Semantic Versioning Guidelines

wpull.version.version_info

A tuple in the same format of sys.version_info

wpull.version.get_version_tuple(string)[source]

Return a version tuple from a string.

waiter Module

Delays between requests.

class wpull.waiter.LinearWaiter(wait=0.0, random_wait=False, max_wait=10.0)[source]

Bases: wpull.waiter.Waiter

A linear back-off waiter.

Parameters:
  • wait – The normal delay time
  • random_wait – If True, randomly perturb the delay time within a factor of 0.5 and 1.5
  • max_wait – The maximum delay time

This waiter will increment by values of 1 second.

get()[source]
increment()[source]
reset()[source]
class wpull.waiter.Waiter[source]

Bases: object

Base class for Waiters.

Waiters are counters that indicate the delay between requests.

get()[source]

Return the time in seconds.

increment()[source]

Increment the delay possibly due to an error.

reset()[source]

Reset the delay back to normal setting.

warc Module

warc.format Module

WARC format.

For the WARC file specification, see http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf.

For the CDX specifications, see https://archive.org/web/researcher/cdx_file_format.php and https://github.com/internetarchive/CDX-Writer.

class wpull.warc.format.WARCRecord[source]

Bases: object

A record in a WARC file.

fields

An instance of namevalue.NameValueRecord.

block_file

A file object. May be None.

CONTENT_TYPE = 'Content-Type'
NAME_OVERRIDES = frozenset({'Content-Length', 'WARC-Date', 'Content-Type', 'WARC-Warcinfo-ID', 'WARC-Segment-Origin-ID', 'WARC-Segment-Number', 'WARC-Block-Digest', 'WARC-Identified-Payload-Type', 'WARC-Refers-To', 'WARC-Target-URI', 'WARC-Type', 'WARC-Profile', 'WARC-Segment-Total-Length', 'WARC-Payload-Digest', 'WARC-Truncated', 'WARC-Record-ID', 'WARC-Concurrent-To', 'WARC-IP-Address', 'WARC-Filename'})

Field name case normalization overrides because hanzo’s warc-tools do not adequately conform to specifications.

REQUEST = 'request'
RESPONSE = 'response'
REVISIT = 'revisit'
SAME_PAYLOAD_DIGEST_URI = 'http://netpreserve.org/warc/1.0/revisit/identical-payload-digest'
TYPE_REQUEST = 'application/http;msgtype=request'
TYPE_RESPONSE = 'application/http;msgtype=response'
VERSION = 'WARC/1.0'
WARCINFO = 'warcinfo'
WARC_DATE = 'WARC-Date'
WARC_FIELDS = 'application/warc-fields'
WARC_RECORD_ID = 'WARC-Record-ID'
WARC_TYPE = 'WARC-Type'
compute_checksum(payload_offset: typing.Union=None)[source]

Compute and add the checksum data to the record fields.

This function also sets the content length.

get_http_header() → wpull.protocol.http.request.Response[source]

Return the HTTP header.

It only attempts to read the first 4 KiB of the payload.

Returns:Returns an instance of http.request.Response or None.
Return type:Response, None
set_common_fields(warc_type: str, content_type: str)[source]

Set the required fields for the record.

set_content_length()[source]

Find and set the content length.

See also

compute_checksum().

wpull.warc.format.read_cdx(file, encoding='utf8')[source]

Iterate CDX file.

Parameters:
  • file (str) – A file object.
  • encoding (str) – The encoding of the file.
Returns:

Each item is a dict that maps from field key to value.

Return type:

iterator

warc.recorder Module

class wpull.warc.recorder.BaseWARCRecorderSession(recorder, temp_dir=None, url_table=None)[source]

Bases: object

Base WARC recorder session.

close()[source]
class wpull.warc.recorder.FTPWARCRecorderSession(*args, **kwargs)[source]

Bases: wpull.warc.recorder.BaseWARCRecorderSession

FTP WARC Recorder Session.

begin_control(request: wpull.protocol.ftp.request.Request, connection_reused: bool=False)[source]
begin_transfer(response: wpull.protocol.ftp.request.Response)[source]
close(error=None)[source]
control_receive_data(data)[source]
control_send_data(data)[source]
end_control(response: wpull.protocol.ftp.request.Response, connection_closed=False)[source]
end_transfer(response: wpull.protocol.ftp.request.Response)[source]
transfer_receive_data(data: bytes)[source]
class wpull.warc.recorder.HTTPWARCRecorderSession(*args, **kwargs)[source]

Bases: wpull.warc.recorder.BaseWARCRecorderSession

HTTP WARC Recorder Session.

begin_request(request: wpull.protocol.http.request.Request)[source]
begin_response(response: wpull.protocol.http.request.Response)[source]
close()[source]
end_request(request: wpull.protocol.http.request.Request)[source]
end_response(response: wpull.protocol.http.request.Response)[source]
request_data(data: bytes)[source]
response_data(data: bytes)[source]
class wpull.warc.recorder.WARCRecorder(filename, params=None)[source]

Bases: object

Record to WARC file.

Parameters:
  • filename (str) – The filename (without the extension).
  • params (WARCRecorderParams) – Parameters.
CDX_DELIMINATOR = ' '

Default CDX delimiter.

DEFAULT_SOFTWARE_STRING = 'Wpull/2.0.1 Python/3.4.3'

Default software string.

close()[source]

Close the WARC file and clean up any logging handlers.

flush_session()[source]
listen_to_ftp_client(client: wpull.protocol.ftp.client.Client)[source]
listen_to_http_client(client: wpull.protocol.http.client.Client)[source]
new_ftp_recorder_session() → 'FTPWARCRecorderSession'[source]
new_http_recorder_session() → 'HTTPWARCRecorderSession'[source]
classmethod parse_mimetype(value)[source]

Return the MIME type from a Content-Type string.

Returns:A string in the form type/subtype or None.
Return type:str, None
set_length_and_maybe_checksums(record, payload_offset=None)[source]

Set the content length and possibly the checksums.

write_record(record)[source]

Append the record to the WARC file.

wpull.warc.recorder.WARCRecorderParams

WARCRecorder parameters.

Parameters:
  • compress (bool) – If True, files will be compressed with gzip
  • extra_fields (list) – A list of key-value pairs containing extra metadata fields
  • temp_dir (str) – Directory to use for temporary files
  • log (bool) – Include the program logging messages in the WARC file
  • appending (bool) – If True, the file is not overwritten upon opening
  • digests (bool) – If True, the SHA1 hash digests will be written.
  • cdx (bool) – If True, a CDX file will be written.
  • max_size (int) – If provided, output files are named like name-00000.ext and the log file will be in name-meta.ext.
  • move_to (str) – If provided, completed WARC files and CDX files will be moved to the given directory
  • url_table (database.URLTable) – If given, then revist records will be written.
  • software_string (str) – The value for the software field in the Warcinfo record.

alias of WARCRecorderParamsType

writer Module

Document writers.

class wpull.writer.AntiClobberFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]

Bases: wpull.writer.BaseFileWriter

File writer that downloads to a new filename if the original exists.

session_class
class wpull.writer.AntiClobberFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]

Bases: wpull.writer.BaseFileWriterSession

class wpull.writer.BaseFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]

Bases: wpull.writer.BaseWriter

Base class for saving documents to disk.

Parameters:
  • path_namer – The path namer.
  • file_continuing – If True, the writer will modify requests to fetch the remaining portion of the file
  • headers_included – If True, the writer will include the HTTP header responses on top of the document
  • local_timestamping – If True, the writer will set the Last-Modified timestamp on downloaded files
  • adjust_extension – If True, HTML or CSS file extension will be added whenever it is detected as so.
  • content_disposition – If True, the filename is extracted from the Content-Disposition header.
  • trust_server_names – If True and there is redirection, use the last given response for the filename.
session() → wpull.writer.BaseFileWriterSession[source]

Return the File Writer Session.

session_class

Return the class of File Writer Session.

This should be overridden by subclasses.

class wpull.writer.BaseFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]

Bases: wpull.writer.BaseWriterSession

Base class for File Writer Sessions.

discard_document(response: wpull.protocol.abstract.request.BaseResponse)[source]
extra_resource_path(suffix: str) → str[source]
classmethod open_file(filename: str, response: wpull.protocol.abstract.request.BaseResponse, mode='wb+')[source]

Open a file object on to the Response Body.

Parameters:
  • filename – The path where the file is to be saved
  • response – Response
  • mode – The file mode

This function will create the directories if not exist.

process_request(request: wpull.protocol.abstract.request.BaseRequest)[source]
process_response(response: wpull.protocol.abstract.request.BaseResponse)[source]
save_document(response: wpull.protocol.abstract.request.BaseResponse)[source]
classmethod save_headers(filename: str, response: wpull.protocol.http.request.Response)[source]

Prepend the HTTP response header to the file.

Parameters:
  • filename – The path of the file
  • response – Response
classmethod set_timestamp(filename: str, response: wpull.protocol.http.request.Response)[source]

Set the Last-Modified timestamp onto the given file.

Parameters:
  • filename – The path of the file
  • response – Response
class wpull.writer.BaseWriter[source]

Bases: object

Base class for document writers.

session() → 'BaseWriterSession'[source]

Return a session for a document.

class wpull.writer.BaseWriterSession[source]

Bases: object

Base class for a single document to be written.

discard_document(response: wpull.protocol.abstract.request.BaseResponse)[source]

Don’t save the document.

This function is called by a Processor once the Processor deemed the document should be deleted (i.e., a “404 Not Found” response).

extra_resource_path(suffix: str) → typing.Union[source]

Return a filename suitable for saving extra resources.

process_request(request: wpull.protocol.abstract.request.BaseRequest) → wpull.protocol.abstract.request.BaseRequest[source]

Rewrite the request if needed.

This function is called by a Processor after it has created the Request, but before submitting it to a Client.

Returns:The original Request or a modified Request
process_response(response: wpull.protocol.abstract.request.BaseResponse)[source]

Do any processing using the given response if needed.

This function is called by a Processor before any response or error handling is done.

save_document(response: wpull.protocol.abstract.request.BaseResponse) → str[source]

Process and save the document.

This function is called by a Processor once the Processor deemed the document should be saved (i.e., a “200 OK” response).

Returns:The filename of the document.
class wpull.writer.IgnoreFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]

Bases: wpull.writer.BaseFileWriter

File writer that ignores files that already exist.

session_class
class wpull.writer.IgnoreFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]

Bases: wpull.writer.BaseFileWriterSession

process_request(request)[source]
class wpull.writer.MuxBody(stream: typing.BinaryIO, **kwargs)[source]

Bases: wpull.body.Body

Writes data into a second file.

close()[source]
flush()[source]
write(data: bytes) → int[source]
writelines(lines)[source]
class wpull.writer.NullWriter[source]

Bases: wpull.writer.BaseWriter

File writer that doesn’t write files.

session() → wpull.writer.NullWriterSession[source]
class wpull.writer.NullWriterSession[source]

Bases: wpull.writer.BaseWriterSession

discard_document(response)[source]
extra_resource_path(suffix)[source]
process_request(request)[source]
process_response(response)[source]
save_document(response)[source]
class wpull.writer.OverwriteFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]

Bases: wpull.writer.BaseFileWriter

File writer that overwrites files.

session_class
class wpull.writer.OverwriteFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]

Bases: wpull.writer.BaseFileWriterSession

class wpull.writer.SingleDocumentWriter(stream: typing.BinaryIO, headers_included: bool=False)[source]

Bases: wpull.writer.BaseWriter

Writer that writes all the data into a single file.

session() → wpull.writer.SingleDocumentWriterSession[source]
class wpull.writer.SingleDocumentWriterSession(stream: typing.BinaryIO, headers_included: bool)[source]

Bases: wpull.writer.BaseWriterSession

Write all data into stream.

discard_document(response)[source]
extra_resource_path(suffix)[source]
process_request(request)[source]
process_response(response: wpull.protocol.abstract.request.BaseResponse)[source]
save_document(response)[source]
class wpull.writer.TimestampingFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]

Bases: wpull.writer.BaseFileWriter

File writer that only downloads newer files from the server.

session_class
class wpull.writer.TimestampingFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]

Bases: wpull.writer.BaseFileWriterSession

process_request(request: wpull.protocol.abstract.request.BaseRequest)[source]

What’s New

Summary of notable changes.

Unreleased

2.0.1 (2016-06-21)

  • Fixed: KeyError crash when psutil was not installed.
  • Fixed: AttributeError proxy error using PhantomJS due to response body not written to a file.

2.0 (2016-06-17)

  • Removed: Lua scripting support and its Python counterpart (--lua-script and --python-script).
  • Removed: Python 3.2 & 3.3 support.
  • Removed: PyPy support.
  • Changed: IP addresses are normalized to a standard notation to avoid fetching duplicates such as IPv4 addresses written in hexadecimal or long-hand IPv6 addresses.
  • Changed: Scripting is now done using plugin interface via --plugin-script.
  • Fixed: Support for Python 3.5.
  • Fixed: FTP unable to handle directory listing with date in MMM DD YYYY and filename containing YYYY-MM-DD text.
  • Fixed: Downloads through the proxy (such as PhantomJS) now show up in the database and can be controlled through scripting.
  • Fixed: NotFound error when converting links in CSS file that contain URLs that were not fetched.
  • Fixed: When resuming a forcefully interrupted crawl (e.g., a crash) using a database, URLs in progress were not restarted because they were not reset in the database when the program started up.

Backwards incompatibility

This release contains backwards incompatible changes to the database schema and scripting interface.

If you use --database, the database created by older versions in Wpull cannot be used in this version.

Scripting hook code will need to be rewritten to use the new API. See the new documentation for scripting for the new style of interfacing with Wpull.

Additionally for scripts, the internal event loop has switched from Trollius to built-in Asyncio.

1.2.3 (2016-02-03)

  • Removed: cx_freeze build support.
  • Deprecated: Lua Scripting support will be removed in next release.
  • Deprecated: Python 3.2 & 3.3 support will be removed in the next release.
  • Deprecated: PyPy support will be removed in the next release.
  • Fixed: Error when logging in with FTP to servers that don’t need a password.
  • Fixed: ValueError when downloading URLs that contain unencoded unprintable characters like Zero Width Non-Joiner or Right to Left Mark.

1.2.2 (2015-10-21)

  • Fixed: --output-document file doesn’t contain content.
  • Fixed: OverflowError when URL contains invalid port number greater than 65535 or less than 0.
  • Fixed: AssertionError when saving IPv4-mapped IPv6 addresses to WARC files.
  • Fixed: AttributeError when running with installed Trollius 2.0.
  • Changed: The setup file no longer requires optional psutil.

1.2.1 (2015-05-15)

  • Fixed: OverflowError with URLs with large port numbers.
  • Fixed: TypeError when using standard input as input file (--input-file -).
  • Changed: using --youtube-dl respects --inet4-only and --no-check-certificate now.

1.2 (2015-04-24)

  • Fixed: Connecting to sites with IPv4 & IPv6 support resulted in errors when IPv6 was not supported by the local network. Connections now use Happy Eyeballs Algorithm for IPv4 & IPv6 dual-stack support.
  • Fixed: SQLAlchemy error with PyPy and SQLAlchemy 1.0.
  • Fixed: Input URLs are not fetched in order. Regression since 1.1.
  • Fixed: UnicodeEncodeError when fetching FTP files with non-ASCII filenames.
  • Fixed: Session cookies not loaded when using --load-cookies.
  • Fixed: --keep-session-cookies was always on.
  • Changed: FTP communication uses UTF-8 instead of Latin-1.
  • Changed: --prefer-family=none is now default.
  • Added: none as a choice to --prefer-family.
  • Added: --no-glob and FTP filename glob support.

1.1.1 (2015-04-13)

  • Changed: when using --youtube-dl and --warc-file, the JSON metedata file is now saved in the WARC file compatible with pywb.
  • Changed: logging and progress meter to say “unspecified” instead of “none” when no content length is provided by server to match Wget.

1.1 (2015-04-03)

  • Security: Updated certificate bundle.
  • Fixed: --regex-type to accept pcre instead of posix. Regular expressions always use Python’s regex library. Posix regex is not supported.
  • Fixed: when using --warc-max-size and --warc-append, it wrote to existing sequential WARC files unnecessarily.
  • Fixed: input URLs stored in memory instead of saved on disk. This issue was notable if there were many URLs provided by the --input-file option.
  • Changed: when using --warc-max-size and --warc-append, the next sequential WARC file is created to avoid appending to corrupt files.
  • Changed: WARC file writing to use journal files and refuse to start program if any journals exist. This avoids corrupting files through naive use of --warc-append and allow for future automated recovery.
  • Added: Open Graph and Twitter Card element links extraction.

1.0 (2015-03-14)

  • Fixed: a --database path with a question mark (?) truncated the path, did not use a on-disk database, or causes TypeError. The question mark is automatically replaced with an underscore.
  • Fixed: HTTP proxy support broken since version 0.1001.
  • Added: no_proxy environment variable support.
  • Added: --proxy-domains, --proxy-exclude-domains, --proxy-hostnames, --proxy-exclude-hostnames
  • Removed: --no-secure-proxy-tunnel.

0.1009 (2015-03-08)

  • Added: --preserve-permissions.
  • Fixed: exit code returned as 2 instead of 1 on generic errors.
  • Changed: exception tracebacks are printed only on generic errors.
  • Changed: temporary WARC log file is now compressed to save space.

Scripting Hook API:

  • Added: Version 3 API
  • Added: wait_time to version 3 which provides useful context including response or error infos.

0.1008 (2015-02-26)

  • Security: updated certificate bundle.
  • Fixed: TypeError crash on bad Meta Refresh HTML element.
  • Fixed: unable to fetch FTP files with spaces and other special characters.
  • Fixed: AssertionError fetching URLs with trailing dot not properly removed.
  • Added: --no-cache.
  • Added: --report-speed.
  • Added: --monitor-disk and --monitor-memory.

0.1007 (2015-02-19)

  • Fixed malformed URLs printed to logs without sanitation.
  • Fixed AttributeError crash on FTP servers that support MLSD.
  • Improved link recursion heuristics when extracting from JavaScript and HTML.
  • Added --retr-symlinks.
  • Added --session-timeout.

0.1006.1 (2015-02-09)

  • Security: Fixed Referer HTTP header field leaking from HTTPS to HTTP.
  • Fixed AttributeError in proxy when using PhantomJS and pre_response scripting hook.
  • Fixed early program end when server returns error fetching robots.txt.
  • Fixed uninteresting errors outputted if program is forcefully closed.
  • Fixed --referer option not applied to subsequent requests.

0.1006 (2015-02-01)

  • Fixed inability to fetch URLs with hostnames starting/ending with hyphen.
  • Fixed “Invalid file descriptor” error in proxy server.
  • Fixed FTP listing dates mistakenly parsed as future date within the same month.
  • Added --escaped-fragment option.
  • Added --strip-session-id option.
  • Added --no-skip-getaddrinfo option.
  • Added --limit-rate option.
  • Added --phantomjs-max-time option.
  • Added --youtube-dl option.
  • Added --plugin-script option.
  • Improved PhantomJS stability.

0.1005 (2015-01-15)

  • Security: SSLv2/SSLv3 is disabled for --secure-protocol=auto. Added --no-strong-crypto that re-enables them again if needed.
  • Fixed NameError with PhantomJS proxy on Python 3.2.
  • Fixed PhantomJS stop waiting for page load too early.
  • Fixed “Line too long” error and remove uninteresting page errors during PhantomJS.
  • Fixed --page-requisites exceeding --level.
  • Fixed --no-verbose not providing informative messages and behaving like --quiet.
  • Fixed infinite page requisite recursion when using --span-hosts-allow page-requisites.
  • Added --page-requisites-level. The default max recursion depth on page requisites is now 5.
  • Added --very-quiet.
  • --no-verbose is defaulted when --concurrent is 2 or greater.

Database Schema:

  • URL inline column is now an integer.

0.1004.2 (2015-01-03)

Hotfix release.

  • Fixed PhantomJS mode’s MITM proxy AttributeError on certificates.

0.1004.1 (2015-01-03)

  • Fixed TypeError crash on a bad cookie.
  • Fixed PhantomJS mode’s MITM proxy SSL certificates not installed.

0.1004 (2014-12-25)

  • Fixed FTP data connection reuse error.
  • Fixed maximum recursion depth exceeded on FTP downloads.
  • Fixed FTP file listing detecting dates too eagerly as ISO8601 format.
  • Fixed crash on FTP if file listing could not find a date in a line.
  • Fixed HTTP status code 204 “No Content” interpreted as an error.
  • Fixed “cert already in hash table” error when using both OS and Wpull’s certificates.
  • Improved PhantomJS stability. Timeout errors should be less frequent.
  • Added --adjust-extension.
  • Added --content-disposition.
  • Added --trust-server-names.

0.1003 (2014-12-11)

  • Fixed FTP fetch where code 125 was not recognized as valid.
  • Fixed FTP 12 o’clock AM/PM time logic.
  • Fixed URLs fetched as lowercase URLs when scheme and authority separator is not provided.
  • Added --database-uri option to specify a SQLAlchemy URI.
  • Added none as a choice to --progress.
  • Added --user/--password support.
  • Scripting:
    • Fixed missing response callback during redirects. Regression introduced in v0.1002.

0.1002 (2014-11-24)

  • Fixed control characters printed without escaping.
  • Fixed cookie size not limited correctly per domain name.
  • Fixed URL parsing incorrectly allowing spaces in hostnames.
  • Fixed --sitemaps option not respecting --no-parent.
  • Fixed “Content overrun” error on broken web servers. A warning is logged instead.
  • Fixed SSL verification error despite --no-check-certificate is specified.
  • Fixed crash on IPv6 URLs containing consecutive dots.
  • Fixed crash attempting to connect to IPv6 addresses.
  • Consecutive slashes in URL paths are now flattened.
  • Fixed crash when fetching IPv6 robots.txt file.
  • Added experimental FTP support.
  • Switched default HTML parser to html5lib.
  • Scripting:
    • Added handle_pre_response callback hook.
  • API:
    • Fixed ConnectionPool max_host_count argument not used.
    • Moved document scraping concerns from WebProcessorSession to ProcessingRule.
    • Renamed SSLVerficationError to SSLVerificationError.

0.1001.2 (2014-10-25)

  • Fixed ValueError crash on HTTP redirects with bad IPv6 URLs.
  • Fixed AssertionError on link extraction with non-absolute URLs in “codebase” attribute.
  • Fixed premature exit during an error fetching robots.txt.
  • Fixed executable filename problem in setup.py for cx_Freeze builds.

0.1001.1 (2014-10-09)

  • Fixed URLs with IPv6 addresses not including brackets when using them in host strings.
  • Fixed AssertionError crash where PhantomJS crashed.
  • Fixed database slowness over time.
  • Cookies are now synchronized and shared with PhantomJS.
  • Scripting:
    • Fixed mismatched queued_url` and ``dequeued_url causing negative values in a counter. Issue was caused by requeued items in “error” status.

0.1001 (2014-09-16)

  • Fixed --warc-move option which had no effect.
  • Fixed JavaScript scraper to not accept URLs with backslashes.
  • Fixed CSS scraper to not accept URLs longer than 500 characters.
  • Fixed ValueError crash in Cache when two URLs are added sequentially at the same time due to bad LinkedList key comparison.
  • Fixed crash formatting text when sizes reach terabytes.
  • Fixed hang which may occur with lots of connection across many hostnames.
  • Support for HTTP/HTTPS proxies but no HTTPS tunnelling support. Wpull will refuse to start without the insecure override option. Note that if authentication and WARC file is enabled, the username and password is recorded into the WARC file.
  • Improved database performance.
  • Added --ignore-fatal-errors option.
  • Added --http-parser option. You can now use html5lib as the HTML parser.
  • Support for PyPy 2.3.1 running with Python 3.2 implementation.
  • Consistent URL parsing among various Python versions.
  • Added --link-extractors option.
  • Added --debug-manhole option.
  • API:
    • document and scraper were put into their own packages.
    • HTML parsing was put into document.htmlparse package.
    • url.URLInfo no longer supports normalizing URLs by percent decoding unreserved/safe characters.
  • Scripting:
    • Dropped support for Scripting API version 1.
  • Database schema:
    • Column url_encoding is removed from urls table.

0.1000 (2014-09-02)

  • Dropped support for Python 2. Please file an issue if this is a problem.
  • Fixed possible crash on empty content with deflate compression.
  • Fixed document encoding detection on documents larger than 4096 bytes where an encoded character may have been truncated.
  • Always percent-encode IRIs with UTF-8 to match de facto web browser implementation.
  • HTTP headers are consistently decoded as Latin-1.
  • Scripting API:
    • New queued_url and dequeued_url hooks contributed by mback2k.
  • API:
    • Switched to Trollius instead of Tornado. Please use Trollius 1.0.2 alpha or greater.
    • Most the of internals related to the HTTP protocol were rewritten and as a result, major components are not backwards compatible; lots of changes were made. If you happen to be using Wpull’s API, please pin your requirements to <0.1000 if you do not want to make a migration. Please file an issue if this is a problem.

0.36.4 (2014-08-07)

  • Fixes crash when --save-cookies is used with non-ASCII cookies. Cookies with non-ASCII values are discarded.
  • Fixed HTTP gzip compressed content not decompressed during chunked transfer of single bytes.
  • Tornado 4.0 support.
  • API:
    • Renamed: cookie.CookieLimitsPolicy to DeFactoCookiePolicy.

0.36.3 (2014-07-25)

  • Improved performance on --database option. SQLite now uses synchronous=NORMAL instead of FULL.

0.36.2 (2014-07-16)

  • Fixed requirements.txt to use Tornado version less than 4.0.

0.36.1 (2014-07-16)

  • Fixes bug where “FINISHED” message was not logged in WARC file meta log. Regression was introduced in version 0.35.

0.36 (2014-06-23)

  • Works around PhantomJSRPCTimedOut errors.
  • Adds --phantomjs-exe option.
  • Supports extracting links from HTML img srcset attribute.
  • API:
    • Builder.build() returns Application instead of Engine.
    • Callback hooks exit_status and finishing_statistics now registered on Application instead of Engine.
    • network module split into two modules bandwidth and dns.
    • Adds observer module.
    • phantomjs.PhantomJSRemote.page_event renamed to page_observer.

0.35 (2014-06-16)

  • Adds --warc-move option.
  • Scripting:
    • Default scripting version is now 2.
  • API:
    • Builder moved into new module builder
    • Adds Application class intended for different UI in the future.
    • Resolver families parameter renamed into family. It accepts values from the module socket or PREFER_IPv4/PREFER_IPv6.
    • Adds HookableMixin. This removes the use of messy subclassing for scripting hooks.

0.34.1 (2014-05-26)

  • Fixes crash when a URL is incorrectly formatted by Wpull. (The incorrect formatting is not fixed yet however.)

0.34 (2014-05-06)

  • Fixes file descriptor leak with --phantomjs and --delete-after.
  • Fixes case where robots.txt file was stuck in download loop if server was offline.
  • Fixes loading of cookies file from Wget. Cookie file header checks are disabled.
  • Removes unneeded --no-strong-robots (superseded with --no-strong-redirects.)
  • Fixes --no-phantomjs-snapshot option not respected.
  • More link extraction on HTML pages with elements with onclick, onkeyX, onmouseX, and data- attributes.
  • Adds web-based debugging console with --debug-console-port.

0.33.2 (2014-04-29)

  • Fixes links not resolved correctly when document includes <base href="..."> element.
  • Different proxy URL rewriting for PhantomJS option.

0.33.1 (2014-04-26)

  • Fixes --bind_address option not working. The option was never functional since the first release.
  • Fixes AttributeError crash when --phantomjs and --X-script options were used. Thanks to yipdw for reporting.
  • Fixes --warc-tempdir to use the current directory by default.
  • Fixes bad formatting and crash on links with malformed IPv6 addresses.
  • Uses more rules for link extraction from JavaScript to reduce false positives.

0.33 (2014-04-21)

  • Fixes invalid XHTML documents not properly extracted for links.
  • Fixes crash on empty page.
  • Support for extracting links from JavaScript segments and files.
  • Doesn’t discard extracted links if document can only be parsed partially.
  • API:
    • Moves OrderedDefaultDict from util to collections.
    • Moves DeflateDecompressor, gzip_decompress from util to decompression.
    • Moves sleep, TimedOut, wait_future, AdjustableSemaphore from util to async.
    • Moves to_bytes, to_str, normalize_codec_name, detect_encoding, try_decoding, format_size, printable_bytes, coerce_str_to_ascii from util to string.
    • Removes extended module.
  • Scripting:
    • Adds new wait_time() callback hook function.

0.32.1 (2014-04-20)

  • Fixes XHTML documents not properly extracted for links.
  • If a server responds with content declared as Gzip, the content is checked to see if it starts with the Gzip magic number. This check avoids misreading text as Gzip streams.

0.32 (2014-04-17)

  • Fixes crash when HTML meta refresh URL is empty.
  • Fixes crash when decoding a document that is malformed later in the document. These invalid documents are not searched for links.
  • Reduces CPU usage when --debug logging is not enabled.
  • Better support for detecting and differentiating XHTML and XML documents.
  • Fixes converting XHTML documents where it did not write XHTML syntax.
  • RSS/Atom feed link, url, icon elements are searched for links.
  • API:
    • document.detect_response_encoding() default peek argument is lowered to reduce hanging.
    • document.BaseDocumentDetector is now a base class for document type detection.

0.31 (2014-04-14)

  • Fixes issue where an early </html> causes link discovery to be broken and converted documents missing elements.
  • Fixes --no-parent which did not behave like Wget. This issue was noticeable with options such as --span-hosts-allow linked-pages.
  • Fixes --level where page requisites were mistakenly not fetched if it exceeds recursion level.
  • Includes PhantomJS version string in WARC warcinfo record.
  • User-agent string no longer includes Mozilla reference.
  • Implements --force-html and --base.
  • Cookies now are limited to approximately 4 kilobytes and a maximum of 50 cookies per domain.
  • Document parsing is now streamed for better handling of large documents.
  • Scripting:
    • Ability to set a scripting API version.
    • Scripting API version 2: Adds record_info argument to handle_error and handle_response.
  • API:
    • WARCRecorder uses new parameter object WARCRecorderParams.
    • document, scraper, converter modules heavily modified to accommodate streaming readers. document.BaseDocumentReader.parse was removed and replaced with read_links.
    • version.version_info available.

0.30 (2014-04-06)

  • Fixes crash on SSL handshake if connection is broken.
  • DNS entries are periodically removed from cache instead of held for long times.
  • Experimental cx_freeze support.
  • PhantomJS:
    • Fixes proxy errors with requests containing a body.
    • Fixes proxy errors with occasional FileNotFoundError.
    • Adds timeouts to calls.
    • Viewport size is now 1200 × 1920.
    • Default --phantomjs-scroll is now 10.
    • Scrolls to top of page before taking snapshot.
  • API:
    • URL filters moved into urlfilter module.
    • Engine uses and exposes interface to AdjustableSemaphore for issue #93.

0.29 (2014-03-31)

  • Fixes SSLVerficationError mistakenly raised during connection errors.
  • --span-hosts no longer implicitly enabled on non-recursive downloads. This behavior is superseded by strong redirect logic. (Use --span-hosts-allow to guarantee fetching of page-requisites.)
  • Fixes URL query strings normalized with unnecessary percent-encoding escapes. Some servers do not handle percent-encoded URLs well.
  • Fixes crash handling directory paths that may contain a filename or a filename that is a directory. This crash occurs when a URL like /blog and /blog/ exists. If a directory path contains a filename, the part of the directory path is suffixed with .d. If a filename is an existing directory, the filename is suffixed with .f.
  • Fixes crash when URL’s hostname contains characters that decompose to dots.
  • Fixes crash when HTML document declares encoding name unknown to Python.
  • Fixes stuck in loop if server returns errors on robots.txt.
  • Implements --warc-dedup.
  • Implements --ignore-length.
  • Implements --output-document.
  • Implements --http-compression.
  • Supports reading HTTP compression “deflate” encoding (both zlib and raw deflate).
  • Scripting:
    • Adds engine_run() callback.
    • Exposes the instance factory.
  • API:
    • connection: Connection arguments changed. Uses ConnectionParams as a parameter object. HostConnectionPool arguments also changed.
    • database: URLDBRecord renamed to URL. URLStrDBRecord renamed to URLString.
  • Schema change:
    • New visits table.

0.28 (2014-03-27)

  • Fixes crash when redirected to malformed URL.
  • Fixes --directory-prefix not being honored.
  • Fixes unnecessary high CPU usage when determining encoding of document.
  • Fixes crash (GeneratorExit exception) when exiting on Python 3.4.
  • Uses new internal socket connection stream system.
  • Updates bundled certificates (Tue Jan 28 09:38:07 2014).
  • PhantomJS:
    • Fixes things not appearing in WARC files. This regression was introduced in 0.26 where PhantomJS’s disk cache was enabled. It is now disabled again.
    • Fixes HTTPS proxy URL rewriting where relative URLs were not properly rewritten.
    • Fixes proxy URL rewriting not working for localhost.
    • Fixes unwanted Accept-Language header picked up from environment. The value has been overridden to *.
    • Fixes --header options left out in requests.
  • API:
    • New iostream module.
    • extended module is deprecated.

0.27 (2014-03-23)

  • Fixes URLs ignored (if any) on command line when --input-file is specified.
  • Fixes crash when redirected to a URL that is not HTTP.
  • Fixes crash if lxml does not recognize the document encoding name. Falls back to Latin1 if lxml does not support the encoding after massaging the encoding name.
  • Fixes crash on IPv6 addresses when using scripting or external API calls.
  • Fixes speed shown as “0.0 B/s” instead of “– B/s” when speed can not be calculated.
  • Implements --local-encoding, --remote-encoding, --no-iri.
  • Implements --https-only.
  • Prints bandwidth speed statistics when exiting.
  • PhantomJS:
    • Implements “smart scrolling” that avoids unnecessary scrolling.
    • Adds --no-phantomjs-smart-scroll
  • API:
    • WebProcessorSession._parse_url() renamed to WebProcessorSession.parse_url()

0.26 (2014-03-16)

  • Fixes crash when URLs like http://example.com] were encountered.
  • Implements --sitemaps.
  • Implements --max-filename-length.
  • Implements --span-hosts-allow (experimental, see issues #61, #66).
  • Query strings items like ?a&b are now preserved and no longer normalized to ?a=&b=.
  • API:
    • url.URLInfo.normalize() was removed since it was mainly used internally.
    • Added url.normalize() convenience function.
    • writer: safe_filename(), url_to_filename(), url_to_dir_path() were modified.

0.25 (2014-03-13)

  • Fixes link converter not operating on the correct files when .N files were written.
  • Fixes apparent hang when Wpull is almost finished on documents with many links.
    • Previously, Wpull adds all URLs to the database causing overhead processing to be done in the database. Now, only requisite URLs are added to the database.
  • Implements --restrict-file-names.
  • Implements --quota.
  • Implements --warc-max-size. Like Wget, “max size” is not the maximum size of each WARC file but it is the threshold size to trigger a new file. Unlike Wget, request and response records are not split across WARC files.
  • Implements --content-on-error.
  • Supports recording scrolling actions in WARC file when PhantomJS is enabled.
  • Adds the wpull command to bin/.
  • Database schema change: filename column was added.
  • API:
    • converter.py: Converters no longer use PathNamer.
    • writer.py: sanitize_file_parts() was removed in favor of new safe_filename(). save_document() returns a filename.
    • WebProcessor now requires a root path to be specified.
    • WebProcessor initializer now takes “parameter objects”.
  • Install requires new dependency: namedlist.

0.24 (2014-03-09)

  • Fixes crash when document encoding could not be detected. Thanks to DopefishJustin for reporting.
  • Fixes non-index files incorrectly saved where an extra directory was added as part of their path.
  • URL path escaping is relaxed. This helps with servers that don’t handle percent-encoding correctly.
  • robots.txt now bypasses the filters. Use --no-strong-robots to disable this behavior.
  • Redirects implicitly span hosts. Use --no-strong-redirects to disable this behavior.
  • Scripting: should_fetch() info dict now contains reason as a key.

0.23.1 (2014-03-07)

  • Important: Fixes issue where URLs were downloaded repeatedly.

0.23 (2014-03-07)

  • Fixes incorrect logic in fetching robots.txt when it redirects to another URL.
  • Fixes port number not included in the HTTP Host header.
  • Fixes occasional RuntimeError when pressing CTRL+C.
  • Fixes fetching URL paths containing dot segments. They are now resolved appropriately.
  • Fixes ASCII progress bar not showing 100% when finished download occasionally.
  • Fixes crash and improves handling of unusual document encodings and settings.
  • Improves handling of links with newlines and whitespace intermixed.
  • Requires beautifulsoup4 as a dependency.
  • API:
    • util.detect_encoding() arguments modified to accept only a single fallback and to accept is_html.
    • document.get_encoding() accepts is_html and peek arguments.

0.22.5 (2014-03-05)

  • The ‘Refresh’ HTTP header is now scraped for URLs.
  • When an error occurs during writing WARC files, the WARC file is truncated back to the last good state before crashing.
  • Works around error “Reached maximum read buffer size” downloading on fast connections. Side effect is intensive CPU usage.

0.22.4 (2014-03-05)

  • Fixes occasional error on chunked transfer encoding. Thanks to ivan for reporting.
  • Fixes handling links with newlines found in HTML pages. Newlines are now stripped in links when scraping pages to better handle HTML soup.

0.22.3 (2014-03-02)

  • Fixes another case of AssertionError on url_item.is_processed when robots.txt was enabled.
  • Fixes crash if a malformed gzip response was received.
  • Fixes --span-hosts to be implicitly enabled (as with --no-robots) if --recursive is not supplied. This behavior unconditionally allows downloading a single file without specifying any options. It is what a user intuitively expects.

0.22.2 (2014-03-01)

  • Improves performance on database operations. CPU usage should be less intensive.

0.22.1 (2014-02-28)

  • Fixes handling of “204 No Content” responses.
  • Fixes AssertionError on url_item.is_processed when robots.txt was enabled.
  • Fixes PhantomJS page scrolling to be consistent.
  • Lengthens PhantomJS viewport to ensure lazy-load images are properly triggered.
  • Lengthens PhantomJS paper size to reduce excessive fragmentation of blocks.

0.22 (2014-02-27)

  • Implements --phantomjs-scroll and --phantomjs-wait.
  • Implements saving HTML and PDF snapshots (including inside WARC file). Disable with --no-phantomjs-snapshot.
  • API: Adds PhantomJSController.

0.21.1 (2014-02-27)

  • Fixes missing dependencies and files in setup.py.
  • For PhantomJS:
    • Fixes capturing HTTPS connections .
    • Fixes statistics counter.
    • Supports very basic scraping of HTML. See Usage section.

0.21 (2014-02-26)

  • Fixes Request factory not used. This resolves issues where the User Agent was not set.
  • Experimental PhantomJS support. It can be enabled with --phantomjs. See the Usage section in the documentation for more details.
  • API changes:
    • The http module was split up into smaller modules: http.client, http.connection, http.request, http.util.
    • ChunkedTransferStreamReader was added as a reusable abstraction.
    • The web module was moved to http.web.
    • Added proxy module.
    • Added phantomjs module.

0.20 (2014-02-22)

  • Implements --no-dns-cache, --accept, --reject.
  • Scripting: Fixes AttributeError crash on handle_error.
  • Another possible fix for issue #27.

0.19.2 (2014-02-18)

  • Fixes crash if a non-HTTP URL was found during download.
  • Lua scripting: Fixes booleans, coming from Wpull, mistakenly converted to integers on Python 2

0.19.1 (2014-02-14)

  • Fixes --timestamping functionality.
  • Fixes --timestamping not checking .orig files.
  • Fixes HTTP handling of responses which do not return content.

0.19 (2014-02-12)

  • Fixes files not actually being written.
  • Implements --convert-links and --backup-converted.
  • API: HTMLScraper functions were refactored to be class methods. ScrapedLink was renamed to LinkInfo.

0.18.1 (2014-02-11)

  • Fixes error when WARC but not CDX option is specified.
  • Fixes closing of the SQLite database to avoid leaving temporary database files.

0.18 (2014-02-11)

  • Implements --no-warc-digests, --warc-cdx.
  • Improvements on reducing CPU usage consumption.
  • API: Engine and Processor interaction refactored to be asynchronous.
    • The Engine and Processor classes were modified significantly.
    • The Engine no longer is concerned with fetching requests.
    • Requests are handled within Processors. This will benefit future Processors to allow them to make arbitrary requests during processing.
    • The RedirectTracker was moved to a new web module.
    • A RichClient is implemented. It handles robots.txt, cookies, and redirect concerns.
    • WARCRecord was moved into a new warc module.

0.17.3 (2014-02-07)

  • Fixes ca-bundle file missing during install.
  • Fixes AttributeError on retry_dns_error.

0.17.2 (2014-02-06)

  • Another attempt to possibly fix #27.
  • Implements cleaning inactive connections from the connection pool.

0.17.1 (2014-02-05)

  • Another attempt to possibly fix #27.
  • API: Refactored ConnectionPool. It now calls put on HostConnectionPool to avoid sharing a queue.

0.17 (2014-02-05)

  • Implements cookie support.
  • Fixes non-recursive downloads where robots.txt was checked unnecessarily.
  • Possibly fix issue #27 where HTTP workers get stuck.

0.16.1 (2014-02-05)

  • Adds some documentation about stopping Wpull and a list of all options.
  • API: Builder now exposes Factory.
  • API: WebProcessorSession was refactored to not pass arguments through the initializer. It also now uses DemuxDocumentScraper and DemuxURLFilter.

0.16 (2014-02-04)

  • Implements all the SSL options: --certificate, --random-file, --egd-file, --secure-protocol.
  • Further improvement on database performance.

0.15.2 (2014-02-03)

  • Improves database performance on reducing CPU usage.

0.15.1 (2014-02-03)

  • Improves database performance on reducing disk reading.

0.15 (2014-02-02)

  • Fixes robots.txt being fetched for every request.
  • Scripts: Supports replace as part of get_urls().
  • Schema change: The database URL strings are normalized into a separate table. Using --database should now consume less disk space.

0.14.1 (2014-02-02)

  • NameValueRecord now supports a normalize_override argument to how specific keys are cased instead of the default title-case.
  • Fixes WARC file’s field names to match the same cases as hanzo’s warc-tools. warc-tools does not support case-insensitivity as required by the WARC specification in section 4. The WARC files generated by Wpull are conformant however.

0.14 (2014-02-01)

  • Database change: SQLAlchemy is now used for the URL Table.
    • Scripts: url_info['inline'] now returns a boolean, not an integer.
  • Implements --post-data and --post-file.
  • Scripts can now return post_data and link_type as part of get_urls().

0.13 (2014-01-31)

  • Supports reading HTTP responses with gzip content type.

0.12 (2014-01-31)

  • No changes to program usage itself.
  • More documentation.
  • Major API changes due to refactoring:
    • http.Body moved to conversation.Body
    • document.HTTPScraper, document.CSSScraper moved to scraper module.
    • conversation module now contains base classes for protocol elements.
    • processor.WebProcessorSession now uses keyword arguments
    • engine.Engine requires Statistics argument.

0.11 (2014-01-29)

  • Implements --progress which includes a progress bar indicator.
  • Bumps up the HTTP connection buffer size to support fast connections.

0.10.9 (2014-01-28)

  • Adds documentation. No program changes.

0.10.8 (2014-01-26)

  • Improves robustness against bad HTTP protocol messages.
  • Fixes various URL and IRI handling issues.
  • Fixes --input-file to work as expected.
  • Fixes command line arguments not working under Python 2.

0.10 (2014-01-23)

  • Improves handling on URLs and document encodings.
  • Implements --ascii-print.
  • Fixes Lua scripting conversion of Python to Lua object types.

0.9 (2014-01-21)

  • Adds basic SSL options.

0.8 (2014-01-21)

  • Supports Python and Lua scripting via --python-script and --lua-script.

0.7 (2014-01-18)

  • Fixes robots.txt support.

0.6 (2014-01-17)

  • Implements --warc-append, --concurrent.
  • --read-timeout default is 900 seconds.

0.5 (2014-01-17)

  • Implements --no-http-keepalive, --rotate-dns.
  • Adds basic support for HTTPS.

0.4 (2014-01-15)

  • Implements --continue, --no-clobber, --timestamping.

0.3.2 (2014-01-07)

  • Fixes database rows not saved correctly.

0.3 (2014-01-07)

  • Implements --hostnames and --exclude-hostnames.

0.2 (2014-01-06)

  • Implements --header option.
  • Various 3to2 bug fixes.

0.1 (2014-01-05)

  • The first usable release.

WARC Specification

Additional de-facto and custom extensions to the WARC standard.

Wpull follows the specifications in the ISO 28500 latest draft.

FTP

FTP recording follows Heritrix specifications.

Control Conversation

The Control Conversation is recorded as

  • WARC-Type: metadata
  • Content-Type: text/x-ftp-control-conversation
  • WARC-Target-URI: a URL. For example, ftp://anonymous@example.com/treasure.txt
  • WARC-IP-Address: an IPv4 address with port or an IPv6 address with brackets and port

The resource is formatted as followed:

  • Events are indented with an ASCII asterisk and space.
  • Requests are indented with an ASCII greater-than and space.
  • Responses are indented with an ASCII less-than and space.

The document encoding is UTF-8.

Changed in version 1.2a1: The document encoding previously used Latin-1.

Response data

The response data is recorded as

  • WARC-Type: resource
  • WARC-Target-URI: a URL. For example, ftp://anonymous@example.com/treasure.txt
  • WARC-Concurrent-To: a WARC Record ID of the Control Conversation

PhantomJS

Snapshot

A PhantomJS Snapshot represents the state of the DOM at the time of capture.

A Snapshot is recorded as

  • WARC-Type: resource
  • WARC-Target-URI: urn:X-wpull:snapshot?url=URLHERE where URLHERE is a percent-encoded URL of the PhantomJS page.
  • Content-Type: one of application/pdf, text/html, image/png
  • WARC-Concurrent-To: a WARC Record ID of a Snapshot Action Metadata.

Snapshot Action Metadata

An Action Metadata is a log of steps performed before a Snapshot is taken.

It is recorded as

  • WARC-Type: metadata
  • Content-Type: application/json
  • WARC-Target-URI: urn:X-wpull:snapshot?url=URLHERE where URLHERE is a percent-encoded URL of the PhantomJS page.

Wpull Metadata

Log

Wpull’s log is recorded as

  • WARC-Type: resource
  • Content-Type: text/plain
  • WARC-Target-URI: urn:X-wpull:log

The document encoding is UTF-8.

youtube-dl

JSON file is recorded as

  • WARC-Type: metadata
  • Content-Type: application/vnd.youtube-dl_formats+json
  • WARC-Target-URI: metadata://AUTHORITY_AND_RESOURCE where AUTHORITY_AND_RESOURCE is the hierarchical part, query, and fragment of the URL passed to youtube-dl. In other words, the URI is the URL where the scheme is replaced with metadata.

Indices and tables