Welcome to Wpull’s documentation!¶
Homepage: | https://github.com/chfoo/wpull |
---|
Contents:
Introduction¶
Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.
Notable Features:
- Written in Python: lightweight, modifiable, robust, & scriptable
- Graceful stopping; on-disk database resume
- PhantomJS & youtube-dl integration (experimental)
Wpull is designed to be (almost) a drop-in replacement for Wget with minimal changes to options. It is designed to run on much larger crawls rather than speedily downloading a single file.
Wpull’s behavior is not an exact duplicate of Wget’s behavior. As such, you should not expect exact output and operation out of Wpull. However, it aims to be a very useful alternative as its source code can be easily modified to fix, change, or extend its behaviors.
For instructions, read on to the next sections. Confused? Check out the Frequently Asked Questions.
Installation¶
Requirements¶
Wpull requires the following:
- Python 3.4.3 or greater
- Tornado 4.0 or greater
- html5lib
- Or lxml for faster but much worse HTML parsing
- chardet
- Or cchardet for faster version of chardet
- SQLAlchemy 0.9 or greater
The following are optional:
- psutil for monitoring disk space
- Manhole for a REPL debugging socket
- PhantomJS 1.9.8, 2.1 for capturing interactive JavaScript pages
- youtube-dl for downloading complex video streaming sites
For installing Wpull, it is recommended to use pip installer.
Wpull is officially supported in a Unix-like environment.
Automatic Install¶
Once you have installed Python, lxml, and pip, install Wpull with dependencies automatically from PyPI:
pip3 install wpull
Tip
Adding the --upgrade
option will upgrade Wpull to the latest
release. Use --no-dependencies
to only upgrade Wpull.
Adding the --user
option will install Wpull into your home
directory.
Automatic install is usually the best option. However, there may be outstanding fixes to bugs that are not yet released to PyPI. In this case, use the manual install.
Manual Install¶
Install the dependencies known to work with Wpull:
pip3 install -r https://raw2.github.com/chfoo/wpull/master/requirements.txt
Install Wpull from GitHub:
pip3 install git+https://github.com/chfoo/wpull.git#egg=wpull
Tip
Using git+https://github.com/chfoo/wpull.git@develop#egg=wpull
as the path will install Wpull’s develop branch.
psutil¶
psutil is required for the disk and memory monitoring options but may not be available. To install:
pip3 install psutil
Pre-built Binaries¶
Wpull has pre-built binaries located at https://launchpad.net/wpull/+download. These are unsupported and may not be up to date.
Caveats¶
Python¶
Please obtain the latest Python release from http://python.org/download/ or your package manager. It is recommended to use Python 3.4.3 or greater. Versions 3.4 and 3.4 are officially supported.
Python 2 and PyPy are not supported.
lxml¶
It is recommended that lxml is obtained through an installer
or pre-built package. Windows packages are provided on
https://pypi.python.org/pypi/lxml. Debian/Ubuntu users
should install python3-lxml
. For more information, see
http://lxml.de/installation.html.
pip¶
If pip is not installed on your system yet, please follow the instructions at http://www.pip-installer.org/en/latest/installing.html to install pip. Note for Linux users, ensure you are executing the appropriate Python version when installing pip.
PhantomJS (Optional)¶
It is recommended to download a prebuilt binary build from http://phantomjs.org/download.html.
Usage¶
Intro¶
Wpull is a command line oriented program much like Wget. It is non-interactive and requires all options to specified on start up. If you are not familiar with Wget, please see the Wikipedia article on Wget.
Example Commands¶
To download the About page of Google.com:
wpull google.com/about
To archive a website:
wpull billy.blogsite.example \
--warc-file blogsite-billy \
--no-check-certificate \
--no-robots --user-agent "InconspiuousWebBrowser/1.0" \
--wait 0.5 --random-wait --waitretry 600 \
--page-requisites --recursive --level inf \
--span-hosts-allow linked-pages,page-requisites \
--escaped-fragment --strip-session-id \
--sitemaps \
--reject-regex "/login\.php" \
--tries 3 --retry-connrefused --retry-dns-error \
--timeout 60 --session-timeout 21600 \
--delete-after --database blogsite-billy.db \
--quiet --output-file blogsite-billy.log
Wpull can also be invoked using:
python3 -m wpull
Stopping & Resuming¶
To gracefully stop Wpull, press CTRL+C (or send SIGINT). Wpull will quit once the current download has finished. To stop immediately, press CTRL+C again (or send SIGTERM).
If you have used the --database
option, Wpull can reuse the
existing database for resuming crawls. This behavior is different than
--continue
. Resuming with --continue
is intended for resuming
partially downloaded files while --database
is intended for resuming
partial crawls.
To resume a crawl provided you have used --database
, simply reuse
the same command options from the previous run. This will maintain the
same behavior as the previous run. You may also tweak the options, for
example, limit the recursion depth.
Note
When resuming downloads with --warc-file
and
--database
, Wpull will overwrite the WARC file by default. This
occurs because Wpull simply maintains a list of URLs that are
fetched and not fetched. You should either rename the existing
file manually or use the additional option --warc-append
or
move the files --warc-move
.
Proxied Services¶
Wpull is able to use an HTTP proxy server to capture traffic from third-party programs such as PhantomJS.
The requests will go through the proxy to Wpull’s HTTP client (which can be recorded with --warc-file
).
Warning
Wpull uses the HTTP proxy insecurely on localhost.
It is possible for another user, on the same machine as Wpull, to send bogus requests to the HTTP proxy. Wpull, however, does not expose the HTTP proxy outside to the net by default.
It is not possible to use the proxy standalone at this time.
PhantomJS Integration¶
PhantomJS support is currently experimental.
--phantomjs
will enable PhantomJS integration.
If a HTML document is encountered, Wpull will open the URL in PhantomJS. After the page is loaded, Wpull will try to scroll the page as specified by --phantomjs-scroll
. Then, the HTML DOM source is scraped for URLs as normal. HTML and PDF snapshots are taken by default.
Currently, Wpull will not do anything else to manipulate the page such as clicking on links. As a consequence, Wpull with PhantomJS is not a complete solution for dynamic web pages yet!
Storing console logs and alert messages inside the WARC file is not yet supported.
youtube-dl Integration¶
youtube-dl support is currently experimental.
--youtube-dl
will enable youtube-dl integration.
If a HTML document is encountered, Wpull will run youtube-dl on the URL. Wpull uses the options for downloading subtitles and thumbnails. Other options are at the default which may not grab the best possible quality. For example, youtube-dl may not grab the highest quality stream because it is not a simple video file.
It is not recommended to use recursion because it may fetch redundant amounts of data.
Storing manifests, metadata, or converted files inside the WARC file is not yet supported.
Options¶
Wget-compatible web downloader and crawler.
usage: wpull [-h] [-V] [--plugin-script FILE] [--plugin-args PLUGIN_ARGS]
[--database FILE | --database-uri URI] [--concurrent N]
[--debug-console-port PORT] [--debug-manhole]
[--ignore-fatal-errors] [--monitor-disk MONITOR_DISK]
[--monitor-memory MONITOR_MEMORY] [-o FILE | -a FILE]
[-d | -v | -nv | -q | -qq] [--ascii-print]
[--report-speed TYPE={bits}] [-i FILE] [-F] [-B URL]
[--http-proxy HTTP_PROXY] [--https-proxy HTTPS_PROXY]
[--proxy-user USER] [--proxy-password PASS] [--no-proxy]
[--proxy-domains LIST] [--proxy-exclude-domains LIST]
[--proxy-hostnames LIST] [--proxy-exclude-hostnames LIST]
[-t NUMBER] [--retry-connrefused] [--retry-dns-error] [-O FILE]
[-nc] [-c] [--progress TYPE={bar,dot,none}] [-N]
[--no-use-server-timestamps] [-S] [-T SECONDS]
[--dns-timeout SECS] [--connect-timeout SECS]
[--read-timeout SECS] [--session-timeout SECS] [-w SECONDS]
[--waitretry SECONDS] [--random-wait] [-Q NUMBER]
[--bind-address ADDRESS] [--limit-rate RATE] [--no-dns-cache]
[--rotate-dns] [--no-skip-getaddrinfo]
[--restrict-file-names MODES=<ascii,lower,nocontrol,unix,upper,windows>]
[-4 | -6 | --prefer-family FAMILY={IPv4,IPv6,none}] [--user USER]
[--password PASSWORD] [--no-iri] [--local-encoding ENC]
[--remote-encoding ENC] [--max-filename-length NUMBER] [-nd | -x]
[-nH] [--protocol-directories] [-P PREFIX] [--cut-dirs NUMBER]
[--http-user HTTP_USER] [--http-password HTTP_PASSWORD]
[--no-cache] [--default-page NAME] [-E] [--ignore-length]
[--header STRING] [--max-redirect NUMBER] [--referer URL]
[--save-headers] [-U AGENT] [--no-robots] [--no-http-keep-alive]
[--no-cookies] [--load-cookies FILE] [--save-cookies FILE]
[--keep-session-cookies] [--post-data STRING | --post-file FILE]
[--content-disposition] [--content-on-error] [--http-compression]
[--html-parser {html5lib,libxml2-lxml}]
[--link-extractors <css,html,javascript>] [--escaped-fragment]
[--strip-session-id]
[--secure-protocol PR={SSLv3,TLSv1,TLSv1.1,TLSv1.2,auto}]
[--https-only] [--no-check-certificate] [--no-strong-crypto]
[--certificate FILE] [--certificate-type TYPE={PEM}]
[--private-key FILE] [--private-key-type TYPE={PEM}]
[--ca-certificate FILE] [--ca-directory DIR]
[--no-use-internal-ca-certs] [--random-file FILE]
[--edg-file FILE] [--ftp-user USER] [--ftp-password PASS]
[--no-remove-listing] [--no-glob] [--preserve-permissions]
[--retr-symlinks [{0,1,no,off,on,yes}]] [--warc-file FILENAME]
[--warc-append] [--warc-header STRING] [--warc-max-size NUMBER]
[--warc-move DIRECTORY] [--warc-cdx] [--warc-dedup FILE]
[--no-warc-compression] [--no-warc-digests] [--no-warc-keep-log]
[--warc-tempdir DIRECTORY] [-r] [-l NUMBER] [--delete-after] [-k]
[-K] [-p] [--page-requisites-level NUMBER] [--sitemaps] [-A LIST]
[-R LIST] [--accept-regex REGEX] [--reject-regex REGEX]
[--regex-type TYPE={pcre}] [-D LIST] [--exclude-domains LIST]
[--hostnames LIST] [--exclude-hostnames LIST] [--follow-ftp]
[--follow-tags LIST] [--ignore-tags LIST]
[-H | --span-hosts-allow LIST=<linked-pages,page-requisites>]
[-L] [-I LIST] [--trust-server-names] [-X LIST] [-np]
[--no-strong-redirects] [--proxy-server]
[--proxy-server-address ADDRESS] [--proxy-server-port PORT]
[--phantomjs] [--phantomjs-exe PATH]
[--phantomjs-max-time PHANTOMJS_MAX_TIME]
[--phantomjs-scroll NUM] [--phantomjs-wait SEC]
[--no-phantomjs-snapshot] [--no-phantomjs-smart-scroll]
[--youtube-dl] [--youtube-dl-exe PATH]
[URL [URL ...]]
- Positional arguments:
urls the URL to be downloaded - Options:
-V, --version show program’s version number and exit --plugin-script load plugin script from FILE --plugin-args arguments for the plugin --database save database tables into FILE instead of memory --database-uri save database tables at SQLAlchemy URI instead of memory --concurrent run at most N downloads at the same time --debug-console-port run a web debug console at given port number --debug-manhole install Manhole debugging socket --ignore-fatal-errors ignore all internal fatal exception errors --monitor-disk pause if minimum free disk space is exceeded --monitor-memory pause if minimum free memory is exceeded -o, --output-file write program messages to FILE -a, --append-output append program messages to FILE -d, --debug print debugging messages -v, --verbose print informative program messages and detailed progress -nv, --no-verbose print informative program messages and errors -q, --quiet print program error messages -qq, --very-quiet do not print program messages unless critical --ascii-print print program messages in ASCII only --report-speed print speed in bits only instead of human formatted units
Possible choices: bits
-i, --input-file download URLs listed in FILE -F, --force-html read URL input files as HTML files -B, --base resolves input relative URLs to URL --http-proxy HTTP proxy for HTTP requests --https-proxy HTTP proxy for HTTPS requests --proxy-user username for proxy “basic” authentication --proxy-password password for proxy “basic” authentication --no-proxy disable proxy support --proxy-domains use proxy only from LIST of hostname suffixes --proxy-exclude-domains don’t use proxy only from LIST of hostname suffixes --proxy-hostnames use proxy only from LIST of hostnames --proxy-exclude-hostnames don’t use proxy only from LIST of hostnames -t, --tries try NUMBER of times on transient errors --retry-connrefused retry even if the server does not accept connections --retry-dns-error retry even if DNS fails to resolve hostname -O, --output-document stream every document into FILE -nc, --no-clobber don’t use anti-clobbering filenames -c, --continue resume downloading a partially-downloaded file --progress choose the type of progress indicator
Possible choices: dot, bar, none
-N, --timestamping only download files that are newer than local files --no-use-server-timestamps don’t set the last-modified time on files -S, --server-response print the protocol responses from the server -T, --timeout set DNS, connect, read timeout options to SECONDS --dns-timeout timeout after SECS seconds for DNS requests --connect-timeout timeout after SECS seconds for connection requests --read-timeout timeout after SECS seconds for reading requests --session-timeout timeout after SECS seconds for downloading files -w, --wait wait SECONDS seconds between requests --waitretry wait up to SECONDS seconds on retries --random-wait randomly perturb the time between requests -Q, --quota stop after downloading NUMBER bytes --bind-address bind to ADDRESS on the local host --limit-rate limit download bandwidth to RATE --no-dns-cache disable caching of DNS lookups --rotate-dns use different resolved IP addresses on requests --no-skip-getaddrinfo always use the OS’s name resolver interface --restrict-file-names list of safe filename modes to use
Possible choices: unix, nocontrol, ascii, windows, lower, upper
-4, --inet4-only connect to IPv4 addresses only -6, --inet6-only connect to IPv6 addresses only --prefer-family prefer to connect to FAMILY IP addresses
Possible choices: none, IPv6, IPv4
--user username for both FTP and HTTP authentication --password password for both FTP and HTTP authentication --no-iri use ASCII encoding only --local-encoding use ENC as the encoding of input files and options --remote-encoding force decoding documents using codec ENC --max-filename-length limit filename length to NUMBER characters -nd, --no-directories don’t create directories -x, --force-directories always create directories -nH, --no-host-directories don’t create directories for hostnames --protocol-directories create directories for URL schemes -P, --directory-prefix save everything under the directory PREFIX --cut-dirs don’t make NUMBER of leading directories --http-user username for HTTP authentication --http-password password for HTTP authentication --no-cache request server to not use cached version of files --default-page use NAME as index page if not known -E, --adjust-extension append HTML or CSS file extension if needed --ignore-length ignore any Content-Length provided by the server --header adds STRING to the HTTP header --max-redirect follow only up to NUMBER document redirects --referer always use URL as the referrer --save-headers include server header responses in files -U, --user-agent use AGENT instead of Wpull’s user agent --no-robots ignore robots.txt directives --no-http-keep-alive disable persistent HTTP connections --no-cookies disables HTTP cookie support --load-cookies load Mozilla cookies.txt from FILE --save-cookies save Mozilla cookies.txt to FILE --keep-session-cookies include session cookies when saving cookies to file --post-data use POST for all requests with query STRING --post-file use POST for all requests with query in FILE --content-disposition use filename given in Content-Disposition header --content-on-error keep error pages --http-compression request servers to use HTTP compression --html-parser select HTML parsing library and strategy
Possible choices: libxml2-lxml, html5lib
--link-extractors specify which link extractors to use
Possible choices: css, html, javascript
--escaped-fragment rewrite links with hash fragments to escaped fragments --strip-session-id remove session ID tokens from links --secure-protocol specify the version of the SSL protocol to use
Possible choices: SSLv3, TLSv1, TLSv1.1, TLSv1.2, auto
--https-only download only HTTPS URLs --no-check-certificate don’t validate SSL server certificates --no-strong-crypto don’t use secure protocols/ciphers --certificate use FILE containing the local client certificate --certificate-type Undocumented
Possible choices: PEM
--private-key use FILE containing the local client private key --private-key-type Undocumented
Possible choices: PEM
--ca-certificate load and use CA certificate bundle from FILE --ca-directory load and use CA certificates from DIR --no-use-internal-ca-certs don’t use CA certificates included with Wpull --random-file use data from FILE to seed the SSL PRNG --edg-file connect to entropy gathering daemon using socket FILE --ftp-user username for FTP login --ftp-password password for FTP login --no-remove-listing keep directory file listings --no-glob don’t use filename glob patterns on FTP URLs --preserve-permissions apply server’s Unix file permissions on downloaded files --retr-symlinks if disabled, preserve symlinks and run with security risks
Possible choices: yes, on, 1, off, no, 0
--warc-file save WARC file to filename prefixed with FILENAME --warc-append append instead of overwrite the output WARC file --warc-header include STRING in WARC file metadata --warc-max-size write sequential WARC files sized about NUMBER bytes --warc-move move WARC files to DIRECTORY as they complete --warc-cdx write CDX file along with the WARC file --warc-dedup write revisit records using digests in FILE --no-warc-compression do not compress the WARC file --no-warc-digests do not compute and save SHA1 hash digests --no-warc-keep-log do not save a log into the WARC file --warc-tempdir use temporary DIRECTORY for preparing WARC files -r, --recursive follow links and download them -l, --level limit recursion depth to NUMBER --delete-after download files temporarily and delete them after -k, --convert-links rewrite links in files that point to local files -K, --backup-converted save original files before converting their links -p, --page-requisites download objects embedded in pages --page-requisites-level limit page-requisites recursion depth to NUMBER --sitemaps download Sitemaps to discover more links -A, --accept download only files with suffix in LIST -R, --reject don’t download files with suffix in LIST --accept-regex download only URLs matching REGEX --reject-regex don’t download URLs matching REGEX --regex-type use regex TYPE
Possible choices: pcre
-D, --domains download only from LIST of hostname suffixes --exclude-domains don’t download from LIST of hostname suffixes --hostnames download only from LIST of hostnames --exclude-hostnames don’t download from LIST of hostnames --follow-ftp follow links to FTP sites --follow-tags follow only links contained in LIST of HTML tags --ignore-tags don’t follow links contained in LIST of HTML tags -H, --span-hosts follow links and page requisites to other hostnames --span-hosts-allow selectively span hosts for resource types in LIST
Possible choices: page-requisites, linked-pages
-L, --relative follow only relative links -I, --include-directories download only paths in LIST --trust-server-names use the last given URL for filename during redirects -X, --exclude-directories don’t download paths in LIST -np, --no-parent don’t follow to parent directories on URL path --no-strong-redirects don’t implicitly allow span hosts for redirects --proxy-server run HTTP proxy server for capturing requests --proxy-server-address bind the proxy server to ADDRESS --proxy-server-port bind the proxy server port to PORT --phantomjs use PhantomJS for loading dynamic pages --phantomjs-exe path of PhantomJS executable --phantomjs-max-time maximum duration of PhantomJS session --phantomjs-scroll scroll the page up to NUM times --phantomjs-wait wait SEC seconds between page interactions --no-phantomjs-snapshot don’t take dynamic page snapshots --no-phantomjs-smart-scroll always scroll the page to maximum scroll count option --youtube-dl use youtube-dl for downloading videos --youtube-dl-exe path of youtube-dl executable
Defaults may differ depending on the operating system. Use --help
to see them.
This is only a programmatically generated listing from the program. In most cases, you can follow Wget’s documentation options. Wpull will follow Wget’s behavior so please check Wget online documentation and resources before asking questions.
Differences between Wpull and Wget¶
In most cases, Wpull can be substituted with Wget easily. However, some options may not be implemented yet. This section describes the reasons for option differences.
Missing in Wpull¶
--background
--execute
--config
--spider
--ignore-case
--ask-password
--unlink
--method
--body-data
--body-file
--auth-no-challenge
: Temporarily on by default, but specifying the option is not yet available. Digest authentication is not yet supported.--no-passive-ftp
--mirror
--strict-comments
: No plans for support of this option.--regex-type=posix
: No plans to support posix regex.- Features greater than Wget 1.15.
Missing in Wget¶
--plugin-plugin
: This provides scripting hooks.--plugin-args
--database
: Enables the use of the on-disk database.--database-uri
--concurrent
: Allows changing the number of downloads that happen at once.--debug-console-port
--debug-manhole
--ignore-fatal-errors
--monitor-disk
: Avoids filling the disk.--monitor-memory
--very-quiet
--ascii-print
: Force replaces Unicode text with escaped values for environments that are ASCII only.--http-proxy
:--https-proxy
--proxy-domains
--proxy-exclude-domains
--proxy-hostnames
--proxy-exclude-hostnames
--retry-dns-error
: Wget considers DNS errors as non-recoverable.--session-timeout
: Abort downloading infinite MP3 streams.--no-skip-getaddrinfo
--no-robots
: Wpull is designed for archiving.--http-compression
(gzip, deflate, & raw deflate)--html-parser
: HTML parsing libraries have many trade-offs. Pick any two: small, fast, reliable.--link-extractors
--escaped-fragment
: Try to force HTML rendering instead of Javascript.--strip-session-id
--no-strong-crypto
--no-use-internal-ca-certs
--warc-append
--warc-move
: Move WARC files out of the way for resuming a crashed crawl.--page-requisites-level
: Prevent infinite downloading of misconfurged server resources such as HTML served under a image.--sitemaps
: Discover more URLs.--hostnames
: Wget simply matches the endings when using--domains
instead of matching each part of the hostname.--exclude-hostnames
--span-hosts-allow
: Allow fetching things such as images hosted on another domain.--no-strong-redirects
--proxy-server
--proxy-server-address
--proxy-server-port
--phantomjs
--phantomjs-exe
--phantomjs-max-time
--phantomjs-scroll
--phantomjs-wait
--no-phantomjs-snapshot
--no-phantomjs-smart-scroll
--youtube-dl
--youtube-dl-exe
Help¶
Frequently Asked Questions¶
What does it mean by “Wget-compatible”?¶
It means that Wpull behaves similarly to Wget, but the internal machinery that powers Wpull is completely different from Wget.
What advantages does Wpull offer over Wget?¶
The motivation for the development of Wpull is to find a replacement for Wget that does not store URLs in memory and is scriptable.
Wpull has support for using a on-disk database so memory requirements remain constant. Wget only stores URLs in memory, Wget will eventually run out of memory if you want to crawl millions of URLs at once.
Another motivation is to provide hooks that accept/reject URLs during the crawl.
What advantages does Wget offer over Wpull?¶
Wget is much more mature and stable. With many developers working on Wget, bug fixes and features arrive faster.
Wget is also written in C which can handle text much faster. Wpull is written in Python which was not designed for blazing fast processing of data. This means that Wpull can be slow processing large documents.
How can change things while it is running? / Is there a GUI or web interface to make things easier?¶
Wpull does not offer a user-friendly interface to make changes as it runs at this time. However, please check out https://github.com/ludios/grab-site which is a web interface built on of Wpull
Wpull is giving an error or not performing correctly.¶
Check that you have the options correct. In most cases, it is a misunderstanding of Wget options.
Otherwise if Wpull is not doing what you want, please visit the issue tracker and see if your issue is there. If not, please inform the developers by creating a new issue.
When you open a new issue, GitHub provides a link to the guidelines document. Please read it to learn how to file a good bug report.
How can I help the development of Wpull? What are the development goals?¶
Please visit the [GitHub repository](https://github.com/chfoo/wpull). From there, you can take a look at:
- The Contributing file for specific instructions on how to help
- The issue tracker for current bugs and features
- The Wiki for the roadmap of the project such as goals and statuses
- And the code, of course
How can I chat or ask a question?¶
For chatting and quick questions, please visit the “unoffical” IRC channel: #archiveteam-bs on EFNet. (Click here if you do not have an IRC client.)
Alternatively if the discussion is lengthy, please use the issue tracker as described above. As a courtesy, if your question is answered on the issue tracker, please close the issue to mark your question as solved.
We highly prefer that you use IRC or the issue tracker. But email is also available: chris.foo@gmail.com
Plugin Scripting Hooks¶
Wpull’s scripting support is modelled after alard’s Wget with Lua hooks.
Scripts are installed using the YAPSY plugin architecture. To create your plugin
script, subclass wpull.application.plugin.WpullPlugin
and
load it with --plugin-script
option.
The plugin interface provides two type of callbacks: hooks and events.
Hook¶
Hooks change the behavior of the program. When the callback is
registered to the hook, it is required to provide a return value
typically one of wpull.application.hook.Actions
. Only
one callback may be registered to a hook.
To register your callback, decorate your callback with
wpull.application.plugin.hook()
.
Event¶
Events are points in the program that occur and are notified to registered listeners.
To register your callback, decorate your callback with
wpull.application.plugin.event()
.
Interfaces¶
The global hooks and events constants are located at
wpull.application.plugin.PluginFunctions
.
PluginFunctions.accept_url
- hook Interface:
FetchRule.plugin_accept_url
PluginFunctions.dequeued_url
- event Interface:
URLTableHookWrapper.dequeued_url
PluginFunctions.exit_status
- hook Interface:
AppStopTask.plugin_exit_status
PluginFunctions.finishing_statistics
- event Interface:
StatsStopTask.plugin_finishing_statistics
PluginFunctions.get_urls
- event Interface:
ProcessingRule.plugin_get_urls
PluginFunctions.handle_error
- hook Interface:
ResultRule.plugin_handle_error
PluginFunctions.handle_pre_response
- hook Interface:
ResultRule.plugin_handle_pre_response
PluginFunctions.handle_response
- hook Interface:
ResultRule.plugin_handle_response
PluginFunctions.queued_url
- event Interface:
URLTableHookWrapper.queued_url
PluginFunctions.resolve_dns
- hook Interface:
Resolver.resolve_dns
PluginFunctions.resolve_dns_result
- event Interface:
Resolver.resolve_dns_result
PluginFunctions.wait_time
- hook Interface:
ResultRule.plugin_wait_time
Example¶
Here is a example Python script. It
- Prints hello on start up
- Refuses to download anything with the word “dog” in the URL
- Scrapes URLs on a hypothetical homepage
- Stops the program execution when the server returns HTTP 429
import datetime
import re
from wpull.application.hook import Actions
from wpull.application.plugin import WpullPlugin, PluginFunctions, hook
from wpull.protocol.abstract.request import BaseResponse
from wpull.pipeline.session import ItemSession
class MyExamplePlugin(WpullPlugin):
def activate(self):
super().activate()
print('Hello world!')
def deactivate(self):
super().deactivate()
print('Goodbye world!')
@hook(PluginFunctions.accept_url)
def my_accept_func(self, item_session: ItemSession, verdict: bool, reasons: dict) -> bool:
return 'dog' not in item_session.request.url
@event(PluginFunctions.get_urls)
def my_get_urls(self, item_session: ItemSession):
if item_session.request.url_info.path != '/':
return
matches = re.finditer(
r'<div id="profile-(\w+)"', item_session.response.body.content
)
for match in matches:
url = 'http://example.com/profile.php?username={}'.format(
match.group(1)
)
item_session.add_child_url(url)
@hook(PluginFunctions.handle_response)
def my_handle_response(item_session: ItemSession):
if item_session.response.response_code == 429:
return Actions.STOP
API¶
Wpull was designed as a command line program and most users do not need to read this section. However, you may be using the scripting hook interface or you may want to reuse a component.
Since Wpull is generally not a library, API backwards compatibility is provided on a best-effort basis; there is no guarantee on whether public or private functions will remain the same. This rule does not include the scripting hook interface which is designed for backwards compatibility.
Here lists all documented classes and functions. Not all members are documented yet. Some members, such as the backported modules, are not documented here.
If the documentation is not sufficient, please take a look at the source code. Suggestions and improvements are welcomed.
Note
The API is not thread-safe. It is intended to be run asynchronously with Asyncio.
Many functions also are decorated with the asyncio.coroutine()
decorator. For more information, see https://docs.python.org/3/library/asyncio.html.
wpull Package¶
application
Module¶
application.app
Module¶
Application main interface.
-
class
wpull.application.app.
Application
(pipeline_series: wpull.pipeline.pipeline.PipelineSeries)[source]¶ Bases:
wpull.application.hook.HookableMixin
Default non-interactive application user interface.
This class manages process signals and displaying warnings.
-
ERROR_CODE_MAP
= OrderedDict([(<class 'wpull.errors.AuthenticationError'>, 6), (<class 'wpull.errors.ServerError'>, 8), (<class 'wpull.errors.ProtocolError'>, 7), (<class 'wpull.errors.SSLVerificationError'>, 5), (<class 'wpull.errors.DNSNotFound'>, 4), (<class 'wpull.errors.ConnectionRefused'>, 4), (<class 'wpull.errors.NetworkError'>, 4), (<class 'OSError'>, 3)])¶ Mapping of error types to exit status.
-
EXPECTED_EXCEPTIONS
= (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.DNSNotFound'>, <class 'wpull.errors.ConnectionRefused'>, <class 'wpull.errors.NetworkError'>, <class 'OSError'>, <class 'OSError'>, <class 'wpull.application.hook.HookStop'>, <class 'StopIteration'>, <class 'SystemExit'>, <class 'KeyboardInterrupt'>)¶ Exception classes that are not crashes.
-
Application.
exit_code
¶
-
application.builder
Module¶
Application support.
-
class
wpull.application.builder.
Builder
(args, unit_test=False)[source]¶ Bases:
object
Application builder.
Parameters: args – Options from argparse.ArgumentParser
-
factory
¶ Return the Factory.
Returns: An factory.Factory
instance.Return type: Factory
-
application.factory
Module¶
Instance creation and management.
-
class
wpull.application.factory.
Factory
(class_map=None)[source]¶ Bases:
collections.abc.Mapping
,object
Allows selection of classes and keeps track of instances.
This class behaves like a mapping. Keys are names of classes and values are instances.
-
class_map
¶ A mapping of names to class types.
-
instance_map
¶ A mapping of names to instances.
-
application.hook
Module¶
Python and Lua scripting support.
See Plugin Scripting Hooks for an introduction.
-
class
wpull.application.hook.
Actions
[source]¶ Bases:
enum.Enum
Actions for handling responses and errors.
-
NORMAL
¶ normal
Use Wpull’s original behavior.
-
RETRY
¶ retry
Retry this item (as if an error has occurred).
-
FINISH
¶ finish
Consider this item as done; don’t do any further processing on it.
-
-
exception
wpull.application.hook.
HookAlreadyConnectedError
[source]¶ Bases:
ValueError
A callback is already connected to the hook.
-
exception
wpull.application.hook.
HookDisconnected
[source]¶ Bases:
RuntimeError
No callback is connected.
-
class
wpull.application.hook.
HookDispatcher
(event_dispatcher_transclusion: typing.Union=None)[source]¶ Bases:
collections.abc.Mapping
Dynamic callback hook system.
application.main
Module¶
application.options
Module¶
Program options.
-
class
wpull.application.options.
AppArgumentParser
(*args, real_exit=True, **kwargs)[source]¶ Bases:
argparse.ArgumentParser
An Argument Parser that builds up the application options.
-
class
wpull.application.options.
AppHelpFormatter
(prog, indent_increment=2, max_help_position=24, width=None)[source]¶ Bases:
argparse.HelpFormatter
-
class
wpull.application.options.
CommaChoiceListArgs
[source]¶ Bases:
frozenset
Specialized frozenset.
This class overrides the
__contains__
function to allow the use of thein
operator for ArgumentParser’schoices
checking for comma separated lists. The function behaves differently only when the objects compared are CommaChoiceListArgs.
application.plugin
Module¶
-
wpull.application.plugin.
PluginClientFunctionInfo
¶ alias of
_PluginClientFunctionInfo
application.plugins
Module¶
application.plugins.arg_warning.plugin
Module¶
application.plugins.debug_console.plugin
Module¶
application.plugins.download_progress.plugin
Module¶
application.plugins.server_response.plugin
Module¶
application.tasks
Module¶
application.tasks.conversion
Module¶
-
class
wpull.application.tasks.conversion.
QueuedFileSession
(app_session: wpull.pipeline.app.AppSession, file_id: int, url_record: wpull.pipeline.item.URLRecord)[source]¶ Bases:
object
application.tasks.database
Module¶
application.tasks.download
Module¶
application.tasks.log
Module¶
application.tasks.plugin
Module¶
-
class
wpull.application.tasks.plugin.
PluginLocator
(directories, paths)[source]¶ Bases:
yapsy.IPluginLocator.IPluginLocator
application.tasks.resmon
Module¶
application.tasks.rule
Module¶
application.tasks.shutdown
Module¶
-
class
wpull.application.tasks.shutdown.
AppStopTask
[source]¶ Bases:
wpull.pipeline.pipeline.ItemTask
,wpull.application.hook.HookableMixin
-
static
plugin_exit_status
(app_session: wpull.pipeline.app.AppSession, exit_code: int) → int[source]¶ Return the program exit status code.
Exit codes are values from
errors.ExitStatus
.Parameters: exit_code – The exit code Wpull wants to return. Returns: The exit code that Wpull will return. Return type: int
-
static
application.tasks.sslcontext
Module¶
application.tasks.stats
Module¶
-
class
wpull.application.tasks.stats.
StatsStopTask
[source]¶ Bases:
wpull.pipeline.pipeline.ItemTask
,wpull.application.hook.HookableMixin
-
static
plugin_finishing_statistics
(app_session: wpull.pipeline.app.AppSession, statistics: wpull.stats.Statistics)[source]¶ Callback containing final statistics.
Parameters: - start_time (float) – timestamp when the engine started
- end_time (float) – timestamp when the engine stopped
- num_urls (int) – number of URLs downloaded
- bytes_downloaded (int) – size of files downloaded in bytes
-
static
application.tasks.warc
Module¶
body
Module¶
Request and response payload.
-
class
wpull.body.
Body
(file=None, directory=None, hint='lone_body')[source]¶ Bases:
object
Represents the document/payload of a request or response.
This class is a wrapper around a file object. Methods are forwarded to the underlying file object.
-
file
¶ file
The file object.
Parameters: - file (file, optional) – Use the given file as the file object.
- directory (str) – If file is not given, use directory for a new temporary file.
- hint (str) – If file is not given, use hint as a filename infix.
-
cache
Module¶
Caching.
-
class
wpull.cache.
CacheItem
(key, value, time_to_live=None, access_time=None)[source]¶ Bases:
object
Info about an item in the cache.
Parameters: - key – The key
- value – The value
- time_to_live – The time in seconds of how long to keep the item
- access_time – The timestamp of the last use of the item
-
expire_time
¶ When the item expires.
-
class
wpull.cache.
FIFOCache
(max_items=None, time_to_live=None)[source]¶ Bases:
wpull.cache.BaseCache
First in first out object cache.
Parameters: - max_items (int) – The maximum number of items to keep.
- time_to_live (float) – Discard items after time_to_live seconds.
Reusing a key to update a value will not affect the expire time of the item.
-
class
wpull.cache.
LRUCache
(max_items=None, time_to_live=None)[source]¶ Bases:
wpull.cache.FIFOCache
Least recently used object cache.
Parameters: - max_items – The maximum number of items to keep
- time_to_live – The time in seconds of how long to keep the item
-
wpull.cache.
total_ordering
(obj)¶
collections
Module¶
Data structures.
-
class
wpull.collections.
FrozenDict
(orig_dict)[source]¶ Bases:
collections.abc.Mapping
,collections.abc.Hashable
Immutable mapping wrapper.
-
hash_cache
¶
-
orig_dict
¶
-
-
class
wpull.collections.
LinkedList
[source]¶ Bases:
object
Doubly linked list.
-
map
¶ dict
A mapping of values to nodes.
-
head
¶ -
The first node.
-
tail
¶ -
The last node.
-
-
class
wpull.collections.
LinkedListNode
(value, head=None, tail=None)[source]¶ Bases:
object
A node in a
LinkedList
.-
value
¶ Any value.
-
head
¶ LinkedListNode
The node in front.
-
tail
¶ LinkedListNode
The node in back.
-
head
-
tail
-
value
-
converter
Module¶
Document content post-processing.
-
class
wpull.converter.
BaseDocumentConverter
[source]¶ Bases:
object
Base class for classes that convert links within a document.
-
class
wpull.converter.
BatchDocumentConverter
(html_parser, element_walker, url_table, backup=False)[source]¶ Bases:
object
Convert all documents in URL table.
Parameters: - url_table – An instance of
database.URLTable
. - backup (bool) – Whether back up files are created.
- url_table – An instance of
-
class
wpull.converter.
CSSConverter
(url_table)[source]¶ Bases:
wpull.scraper.css.CSSScraper
,wpull.converter.BaseDocumentConverter
CSS converter.
-
class
wpull.converter.
HTMLConverter
(html_parser, element_walker, url_table)[source]¶ Bases:
wpull.scraper.html.HTMLScraper
,wpull.converter.BaseDocumentConverter
HTML converter.
cookie
Module¶
HTTP Cookies.
Bases:
http.cookiejar.FileCookieJar
MozillaCookieJar that is compatible with Wget/Curl.
It ignores file header checks and supports session cookies.
Bases:
http.cookiejar.DefaultCookiePolicy
Cookie policy that limits the content and length of the cookie.
Parameters: cookie_jar – The CookieJar instance. This policy class is not designed to be shared between CookieJar instances.
Return approximate length of all cookie key-values for a domain.
Return the number of cookies for the given domain.
cookiewrapper
Module¶
Wrappers that wrap instances to Python standard library.
Bases:
object
Wraps a CookieJar.
Parameters: - cookie_jar – An instance of
http.cookiejar.CookieJar
. - save_filename (str, optional) – A filename to save the cookies.
- keep_session_cookies (bool) – If True, session cookies are kept when saving to file.
Wrapped
add_cookie_header
.Parameters: - request – An instance of
http.request.Request
. - referrer_host (str) – An hostname or IP address of the referrer URL.
- request – An instance of
Save the cookie jar if needed.
Return the wrapped Cookie Jar.
Wrapped
extract_cookies
.Parameters: - response – An instance of
http.request.Response
. - request – An instance of
http.request.Request
. - referrer_host (str) – An hostname or IP address of the referrer URL.
- response – An instance of
- cookie_jar – An instance of
Bases:
object
Wraps a HTTP Response.
Parameters: response – An instance of http.request.Response
Return the header fields as a Message:
Returns: An instance of email.message.Message
. If Python 2, returns an instance ofmimetools.Message
.Return type: Message
Convert a HTTP request.
Parameters: - request – An instance of
http.request.Request
. - referrer_host (str) – The referrering hostname or IP address.
Returns: An instance of
urllib.request.Request
Return type: - request – An instance of
database
Module¶
Storage for tracking URLs.
database.base
Module¶
Base table class.
-
wpull.database.base.
AddURLInfo
¶ alias of
_AddURLInfo
-
class
wpull.database.base.
BaseURLTable
[source]¶ Bases:
object
URL table.
-
add_many
(new_urls: typing.Iterator) → typing.Iterator[source]¶ Add the URLs to the table.
Parameters: new_urls – URLs to be added. Returns: The URLs added. Useful for tracking duplicates.
-
add_one
(url: str, url_properties: typing.Union=None, url_data: typing.Union=None)[source]¶ Add a single URL to the table.
Parameters: - url – The URL to be added
- url_properties – Additional values to be saved
- url_data – Additional data to be saved
-
add_visits
(visits)[source]¶ Add visited URLs from CDX file.
Parameters: visits (iterable) – An iterable of items. Each item is a tuple containing a URL, the WARC ID, and the payload digest.
-
check_in
(url: str, new_status: wpull.pipeline.item.Status, increment_try_count: bool=True, url_result: typing.Union=None)[source]¶ Update record for processed URL.
Parameters: - url – The URL.
- new_status – Update the item status to new_status.
- increment_try_count – Whether to increment the try counter for the URL.
- url_result – Additional values.
-
check_out
(filter_status: wpull.pipeline.item.Status, filter_level: typing.Union=None) → wpull.pipeline.item.URLRecord[source]¶ Find a URL, mark it in progress, and return it.
Parameters: - filter_status – Gets first item with given status.
- filter_level – Gets item with filter_level or lower.
Raises:
-
get_one
(url: str) → wpull.pipeline.item.URLRecord[source]¶ Return a URLRecord for the URL.
Raises: NotFound
-
-
exception
wpull.database.base.
NotFound
[source]¶ Bases:
wpull.database.base.DatabaseError
Item not found in the table.
database.sqlmodel
Module¶
Database SQLAlchemy model.
-
wpull.database.sqlmodel.
DBBase
¶ alias of
Base
-
class
wpull.database.sqlmodel.
QueuedURL
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
-
filename
¶ Local filename of the item.
-
id
¶
-
inline_level
¶ Depth of the page requisite object. 0 is the object, 1 is the object’s dependency, etc.
-
level
¶ Recursive depth of the item. 0 is root, 1 is child of root, etc.
-
link_type
¶ Expected content type of extracted link.
-
parent_url
¶ A descriptor that presents a read/write view of an object attribute.
-
parent_url_string
¶
-
parent_url_string_id
¶ Optional referral URL
-
post_data
¶ Additional percent-encoded data for POST.
-
priority
¶ Priority of item.
-
root_url
¶ A descriptor that presents a read/write view of an object attribute.
-
root_url_string
¶
-
root_url_string_id
¶ Optional root URL
-
status
¶ Status of the completion of the item.
-
status_code
¶ HTTP status code or FTP rely code.
-
try_count
¶ Number of attempts made in order to process the item.
-
url
¶ A descriptor that presents a read/write view of an object attribute.
-
url_string
¶
-
url_string_id
¶ Target URL to fetch
-
-
class
wpull.database.sqlmodel.
URLString
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Table containing the URL strings.
The
URL
references this table.-
id
¶
-
url
¶
-
-
class
wpull.database.sqlmodel.
WARCVisit
(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base
Standalone table for
--cdx-dedup
feature.-
payload_digest
¶
-
url
¶
-
warc_id
¶
-
database.sqltable
Module¶
SQLAlchemy table implementations.
-
class
wpull.database.sqltable.
SQLiteURLTable
(path=':memory:')[source]¶ Bases:
wpull.database.sqltable.BaseSQLURLTable
URL table with SQLite storage.
Parameters: path – A SQLite filename
-
class
wpull.database.sqltable.
GenericSQLURLTable
(url)[source]¶ Bases:
wpull.database.sqltable.BaseSQLURLTable
URL table using SQLAlchemy without any customizations.
Parameters: url – A SQLAlchemy database URL.
-
wpull.database.sqltable.
URLTable
¶ The default URL table implementation.
alias of
SQLiteURLTable
database.wrap
Module¶
URL table wrappers.
-
class
wpull.database.wrap.
URLTableHookWrapper
(url_table)[source]¶ Bases:
wpull.database.base.BaseURLTable
,wpull.application.hook.HookableMixin
URL table wrapper with scripting hooks.
Parameters: url_table – URL table. -
url_table
¶ URL table.
-
static
dequeued_url
(url_info: wpull.url.URLInfo, record_info: wpull.pipeline.item.URLRecord)[source]¶ Callback fired after an URL was retrieved from the queue.
-
debug
Module¶
Debugging utilities.
-
class
wpull.debug.
DebugConsoleHandler
(application, request, **kwargs)[source]¶ Bases:
tornado.web.RequestHandler
-
TEMPLATE
= '<html>\n <style>\n #commandbox {{\n width: 100%;\n }}\n </style>\n <body>\n <p>Welcome to DEBUG CONSOLE!</p>\n <p><tt>Builder()</tt> instance at <tt>wpull_builder</tt>.</p>\n <form method="post">\n <input id="commandbox" name="command" value="{command}">\n <input type="submit" value="Execute">\n </form>\n <pre>{output}</pre>\n </body>\n </html>\n '¶
-
decompression
Module¶
Streaming decompressors.
-
class
wpull.decompression.
DeflateDecompressor
[source]¶ Bases:
wpull.decompression.SimpleGzipDecompressor
zlib decompressor with raw deflate detection.
This class doesn’t do any special. It only tries regular zlib and then tries raw deflate on the first decompress.
-
class
wpull.decompression.
GzipDecompressor
[source]¶ Bases:
wpull.decompression.SimpleGzipDecompressor
gzip decompressor with gzip header detection.
This class checks if the stream starts with the 2 byte gzip magic number. If it is not present, it returns the bytes unchanged.
-
class
wpull.decompression.
SimpleGzipDecompressor
[source]¶ Bases:
object
Streaming gzip decompressor.
The interface is like that of zlib.decompressobj (without some of the optional arguments, but it understands gzip headers and checksums.
document
Module¶
Document handling.
document.base
Module¶
Document bases.
-
class
wpull.document.base.
BaseDocumentDetector
[source]¶ Bases:
object
Base class for classes that detect document types.
-
classmethod
is_file
(file)[source]¶ Return whether the reader is likely able to read the file.
Parameters: file – A file object containing the document. Returns: bool
-
classmethod
is_request
(request)[source]¶ Return whether the request is likely supported.
Parameters: request ( http.request.Request
) – An HTTP request.Returns: bool
-
classmethod
is_response
(response)[source]¶ Return whether the response is likely able to be read.
Parameters: response ( http.request.Response
) – An HTTP response.Returns: bool
-
classmethod
is_supported
(file=None, request=None, response=None, url_info=None)[source]¶ Given the hints, return whether the document is supported.
Parameters: - file – A file object containing the document.
- request (
http.request.Request
) – An HTTP request. - response (
http.request.Response
) – An HTTP response. - url_info (
url.URLInfo
) – A URLInfo.
Returns: If True, the reader should be able to read it.
Return type: bool
-
classmethod
is_url
(url_info)[source]¶ Return whether the URL is likely to be supported.
Parameters: url_info ( url.URLInfo
) – A URLInfo.Returns: bool
-
classmethod
-
class
wpull.document.base.
BaseExtractiveReader
[source]¶ Bases:
object
Base class for document readers that can only extract links.
-
class
wpull.document.base.
BaseHTMLReader
[source]¶ Bases:
object
Base class for document readers for handling SGML-like documents.
-
iter_elements
(file, encoding=None)[source]¶ Return an iterator of elements found in the document.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
Returns: Each item is an element from
document.htmlparse.element
Return type: iterator
-
-
class
wpull.document.base.
BaseTextStreamReader
[source]¶ Bases:
object
Base class for document readers that filters link and non-link text.
-
iter_links
(file, encoding=None, context=False)[source]¶ Return the links.
This function is a convenience function for calling
iter_text()
and returning only the links.
-
iter_text
(file, encoding=None)[source]¶ Return the file text and links.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
Returns: Each item is a tuple:
- str: The text
- bool (or truthy value): Whether the text is a likely a link. If truthy value may be provided containing additional context of the link.
Return type: iterator
The links returned are raw text and will require further processing.
-
-
wpull.document.base.
VeryFalse
= <wpull.document.base.VeryFalseType object>¶ Document is not definitely supported.
document.css
Module¶
Stylesheet reader.
-
class
wpull.document.css.
CSSReader
[source]¶ Bases:
wpull.document.base.BaseDocumentDetector
,wpull.document.base.BaseTextStreamReader
Cascading Stylesheet Document Reader.
-
BUFFER_SIZE
= 1048576¶
-
IMPORT_URL_PATTERN
= '@import\\s*(?:url\\()?[\'"]?([^\\s\'")]{1,500}).*?;'¶
-
STREAM_REWIND
= 4096¶
-
URL_PATTERN
= 'url\\(\\s*([\'"]?)(.{1,500}?)(?:\\1)\\s*\\)'¶
-
URL_REGEX
= re.compile('url\\(\\s*([\'"]?)(.{1,500}?)(?:\\1)\\s*\\)|@import\\s*(?:url\\()?[\'"]?([^\\s\'")]{1,500}).*?;')¶
-
document.html
Module¶
HTML document readers.
-
wpull.document.html.
COMMENT
= <object object>¶ Comment element
-
class
wpull.document.html.
HTMLLightParserTarget
(callback, text_elements=frozenset({'style', 'script', 'link', 'url', 'icon'}))[source]¶ Bases:
object
An HTML parser target for partial elements.
Parameters: - callback –
A callback function. The function should accept the :param 1. tag: The tag name of the element. :type 1. tag: str :param 2. attrib: The attributes of the element. :type 2. attrib: dict :param 3. text: The text of the element.
type 3. text: str, None - text_elements – A frozenset of element tag names that we should keep track of text.
- callback –
-
class
wpull.document.html.
HTMLParserTarget
(callback)[source]¶ Bases:
object
An HTML parser target.
Parameters: callback – A callback function. The function should accept the :param 1. tag: The tag name of the element. :type 1. tag: str :param 2. attrib: The attributes of the element. :type 2. attrib: dict :param 3. text: The text of the element. :type 3. text: str, None :param 4. tail: The text after the element. :type 4. tail: str, None :param 5. end: Whether the tag is and end tag.
type 5. end: bool
-
class
wpull.document.html.
HTMLReadElement
(tag, attrib, text, tail, end)[source]¶ Bases:
object
Results from
HTMLReader.read_links()
.-
tag
¶ str
The element tag name.
-
attrib
¶ dict
The element attributes.
-
text
¶ str, None
The element text.
-
tail
¶ str, None
The text after the element.
-
end
¶ bool
Whether the tag is an end tag.
-
attrib
-
end
-
tag
-
tail
-
text
-
-
class
wpull.document.html.
HTMLReader
(html_parser)[source]¶ Bases:
wpull.document.base.BaseDocumentDetector
,wpull.document.base.BaseHTMLReader
HTML document reader.
Parameters: html_parser ( document.htmlparse.BaseParser
) – An HTML parser.
document.htmlparse
Module¶
HTML parsing.
document.htmlparse.base
Module¶
document.htmlparse.element
Module¶
HTML tree things.
-
wpull.document.htmlparse.element.
Comment
¶ A comment.
-
wpull.document.htmlparse.element.
text
¶ str
The comment text.
alias of
CommentType
-
-
wpull.document.htmlparse.element.
Doctype
¶ A Doctype.
-
wpull.document.htmlparse.element.
text
str
The Doctype text.
alias of
DoctypeType
-
-
wpull.document.htmlparse.element.
Element
¶ An HTML element.
- Attributes
- tag (str): The tag name of the element. attrib (dict): The attributes of the element. text (str, None): The text of the element. tail (str, None): The text after the element. end (bool): Whether the tag is and end tag.
alias of
ElementType
document.htmlparse.html5lib_
Module¶
Parsing using html5lib python.
document.htmlparse.lxml_
Module¶
Parsing using lxml and libxml2.
-
class
wpull.document.htmlparse.lxml_.
HTMLParser
[source]¶ Bases:
wpull.document.htmlparse.base.BaseParser
HTML document parser.
This reader uses lxml as the parser.
-
BUFFER_SIZE
= 131072¶
-
classmethod
detect_parser_type
(file, encoding=None)[source]¶ Get the suitable parser type for the document.
Returns: str
-
classmethod
parse_doctype
(file, encoding=None)[source]¶ Get the doctype from the document.
Returns: str, None
-
parse_lxml
(file, encoding=None, target_class=<class 'wpull.document.htmlparse.lxml_.HTMLParserTarget'>, parser_type='html')[source]¶ Return an iterator of elements found in the document.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
- target_class – A class to be used for target parsing.
- parser_type (str) – The type of parser to use. Accepted values:
html
,xhtml
,xml
.
Returns: Each item is an element from
document.htmlparse.element
Return type: iterator
-
parser_error
¶
-
-
class
wpull.document.htmlparse.lxml_.
HTMLParserTarget
(callback)[source]¶ Bases:
object
An HTML parser target.
Parameters: callback – A callback function. The function should accept one argument from document.htmlparse.element
.
document.javascript
Module¶
-
class
wpull.document.javascript.
JavaScriptReader
[source]¶ Bases:
wpull.document.base.BaseDocumentDetector
,wpull.document.base.BaseTextStreamReader
JavaScript Document Reader.
-
BUFFER_SIZE
= 1048576¶
-
STREAM_REWIND
= 4096¶
-
URL_PATTERN
= '(\\\\{0,8}[\'"])(https?://[^\'"]{1,500}|[^\\s\'"]{1,500})(?:\\1)'¶
-
URL_REGEX
= re.compile('(\\\\{0,8}[\'"])(https?://[^\'"]{1,500}|[^\\s\'"]{1,500})(?:\\1)')¶
-
document.sitemap
Module¶
Sitemap.xml
-
class
wpull.document.sitemap.
SitemapReader
(html_parser)[source]¶ Bases:
wpull.document.base.BaseDocumentDetector
,wpull.document.base.BaseExtractiveReader
Sitemap XML reader.
-
MAX_ROBOTS_FILE_SIZE
= 4096¶
-
document.util
Module¶
Misc functions.
-
wpull.document.util.
detect_response_encoding
(response, is_html=False, peek=131072)[source]¶ Return the likely encoding of the response document.
Parameters: - response (Response) – An instance of
http.Response
. - is_html (bool) – See
util.detect_encoding()
. - peek (int) – The maximum number of bytes of the document to be analyzed.
Returns: The codec name.
Return type: str
,None
- response (Response) – An instance of
document.xml
Module¶
XML document.
driver
Module¶
Interprocess communicators.
driver.phantomjs
Module¶
-
class
wpull.driver.phantomjs.
PhantomJSDriver
(exe_path='phantomjs', extra_args=None, params=None)[source]¶ Bases:
wpull.driver.process.Process
PhantomJS processing.
Parameters: - exe_path (str) – Path of the PhantomJS executable.
- extra_args (list) – Additional arguments for PhantomJS. Most likely, you’ll want to pass proxy settings for capturing traffic.
- params (
PhantomJSDriverParams
) – Parameters for controlling the processing pipeline.
This class launches PhantomJS that scrolls and saves snapshots. It can only be used once per URL.
-
wpull.driver.phantomjs.
PhantomJSDriverParams
¶ PhantomJS Driver parameters
-
wpull.driver.phantomjs.
url
¶ str
URL of page to fetch.
-
wpull.driver.phantomjs.
snapshot_type
¶ list
List of filenames. Accepted extensions are html, pdf, png, gif.
-
wpull.driver.phantomjs.
wait_time
¶ float
Time between page scrolls.
-
wpull.driver.phantomjs.
num_scrolls
¶ int
Maximum number of scrolls.
-
wpull.driver.phantomjs.
smart_scroll
¶ bool
Whether to stop scrolling if number of requests & responses do not change.
-
wpull.driver.phantomjs.
snapshot
¶ bool
Whether to take snapshot files.
-
wpull.driver.phantomjs.
viewport_size
¶ tuple
Width and height of the page viewport.
-
wpull.driver.phantomjs.
paper_size
¶ tuple
Width and height of the paper size.
-
wpull.driver.phantomjs.
event_log_filename
¶ str
Path to save page events.
-
wpull.driver.phantomjs.
action_log_filename
¶ str
Path to save page action manipulation events.
-
wpull.driver.phantomjs.
custom_headers
¶ dict
Custom HTTP request headers.
-
wpull.driver.phantomjs.
page_settings
¶ dict
Page settings.
alias of
PhantomJSDriverParamsType
-
driver.process
Module¶
RPC processes.
errors
Module¶
Exceptions.
-
exception
wpull.errors.
AuthenticationError
[source]¶ Bases:
wpull.errors.ServerError
Username or password error.
-
exception
wpull.errors.
ConnectionRefused
[source]¶ Bases:
wpull.errors.NetworkError
Server was online, but nothing was being served.
-
exception
wpull.errors.
DNSNotFound
[source]¶ Bases:
wpull.errors.NetworkError
Server’s IP address could not be located.
-
wpull.errors.
ERROR_PRIORITIES
= (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.AuthenticationError'>, <class 'wpull.errors.DNSNotFound'>, <class 'wpull.errors.ConnectionRefused'>, <class 'wpull.errors.NetworkError'>, <class 'OSError'>, <class 'OSError'>, <class 'ValueError'>)¶ List of error classes by least severe to most severe.
-
class
wpull.errors.
ExitStatus
[source]¶ Bases:
object
Program exit status codes.
-
generic_error
¶ 1
An unclassified serious or fatal error occurred.
-
parser_error
¶ 2
A local document or configuration file could not be parsed.
-
file_io_error
¶ 3
A problem with reading/writing a file occurred.
-
network_failure
¶ 4
A problem with the network occurred such as a DNS resolver error or a connection was refused.
-
ssl_verification_error
¶ 5
A server’s SSL/TLS certificate was invalid.
-
authentication_failure
¶ 6
A problem with a username or password.
-
protocol_error
¶ 7
A problem with communicating with a server occurred.
-
server_error
¶ 8
The server had problems fulfilling our requests.
-
authentication_failure
= 6
-
file_io_error
= 3
-
generic_error
= 1
-
network_failure
= 4
-
parser_error
= 2
-
protocol_error
= 7
-
server_error
= 8
-
ssl_verification_error
= 5
-
-
exception
wpull.errors.
NetworkTimedOut
[source]¶ Bases:
wpull.errors.NetworkError
Connection read/write timed out.
-
wpull.errors.
SSLVerficationError
¶ alias of
SSLVerificationError
namevalue
Module¶
Key-value pairs.
-
class
wpull.namevalue.
NameValueRecord
(normalize_overrides=None, encoding='utf-8', wrap_width=None)[source]¶ Bases:
collections.abc.MutableMapping
An ordered mapping of name-value pairs.
Duplicated names are accepted.
-
wpull.namevalue.
guess_line_ending
(string)[source]¶ Return the most likely line delimiter from the string.
-
wpull.namevalue.
normalize_name
(name, overrides=None)[source]¶ Normalize the key name to title case.
For example,
normalize_name('content-id')
will becomeContent-Id
Parameters: - name (str) – The name to normalize.
- overrides (set, sequence) – A set or sequence containing keys that
should be cased to themselves. For example, passing
set('WARC-Type')
will normalize any key named “warc-type” toWARC-Type
instead of the defaultWarc-Type
.
Returns: str
network
Module¶
network.bandwidth
Module¶
Network bandwidth.
-
class
wpull.network.bandwidth.
BandwidthLimiter
(rate_limit)[source]¶ Bases:
wpull.network.bandwidth.BandwidthMeter
Bandwidth rate limit calculator.
-
class
wpull.network.bandwidth.
BandwidthMeter
(sample_size=20, sample_min_time=0.15, stall_time=5.0)[source]¶ Bases:
object
Calculates the speed of data transfer.
Parameters: - sample_size (int) – The number of samples for measuring the speed.
- sample_min_time (float) – The minimum duration between samples in seconds.
- stall_time (float) – The time in seconds to consider no traffic to be connection stalled.
-
bytes_transferred
¶ Return the number of bytes transferred
Returns: int
-
feed
(data_len, feed_time=None)[source]¶ Update the bandwidth meter.
Parameters: - data_len (int) – The number of bytes transfered since the last
call to
feed()
. - feed_time (float) – Current time.
- data_len (int) – The number of bytes transfered since the last
call to
-
num_samples
¶ Return the number of samples collected.
-
speed
()[source]¶ Return the current transfer speed.
Returns: The speed in bytes per second. Return type: int
-
stalled
¶ Return whether the connection is stalled.
Returns: bool
network.connection
Module¶
Network connections.
-
class
wpull.network.connection.
BaseConnection
(address: tuple, hostname: typing.Union=None, timeout: typing.Union=None, connect_timeout: typing.Union=None, bind_host: typing.Union=None, sock: typing.Union=None)[source]¶ Bases:
object
Base network stream.
Parameters: - address – 2-item tuple containing the IP address and port or 4-item for IPv6.
- hostname – Hostname of the address (for SSL).
- timeout – Time in seconds before a read/write operation times out.
- connect_timeout – Time in seconds before a connect operation times out.
- bind_host – Host name for binding the socket interface.
- sock – Use given socket. The socket must already by connected.
-
reader
¶ Stream Reader instance.
-
writer
¶ Stream Writer instance.
-
address
¶ 2-item tuple containing the IP address.
-
host
¶ Host name.
-
port
¶ Port number.
-
address
-
host
-
hostname
¶
-
port
-
class
wpull.network.connection.
CloseTimer
(timeout, connection)[source]¶ Bases:
object
Periodic timer to close connections if stalled.
-
class
wpull.network.connection.
Connection
(*args, bandwidth_limiter=None, **kwargs)[source]¶ Bases:
wpull.network.connection.BaseConnection
Network stream.
Parameters: (class (bandwidth_limiter) – .bandwidth.BandwidthLimiter): Bandwidth limiter for connection speed limiting. -
key
¶ Value used by the ConnectionPool for its host pool map. Internal use only.
-
wrapped_connection
¶ A wrapped connection for ConnectionPool. Internal use only.
-
is_ssl
¶ bool
Whether connection is SSL.
-
proxied
¶ bool
Whether the connection is to a HTTP proxy.
-
tunneled
¶ bool
Whether the connection has been tunneled with the
CONNECT
request.
-
is_ssl
-
proxied
-
start_tls
(ssl_context: typing.Union=True) → 'SSLConnection'[source]¶ Start client TLS on this connection and return SSLConnection.
Coroutine
-
tunneled
-
-
class
wpull.network.connection.
ConnectionState
[source]¶ Bases:
enum.Enum
State of a connection
-
ready
¶ Connection is ready to be used
-
created
¶ connect has been called successfully
-
dead
¶ Connection is closed
-
network.dns
Module¶
DNS resolution.
-
wpull.network.dns.
AddressInfo
¶ Socket address.
alias of
_AddressInfo
-
class
wpull.network.dns.
ResolveResult
(address_infos: typing.List, dns_infos: typing.Union=None)[source]¶ Bases:
object
DNS resolution information.
-
addresses
¶ The socket addresses.
-
dns_infos
¶ The DNS resource records.
-
first_ipv4
¶ The first IPv4 address.
-
first_ipv6
¶ The first IPV6 address.
-
-
class
wpull.network.dns.
Resolver
(family: wpull.network.dns.IPFamilyPreference=<IPFamilyPreference.any: 'any'>, timeout: typing.Union=None, bind_address: typing.Union=None, cache: typing.Union=None, rotate: bool=False)[source]¶ Bases:
wpull.application.hook.HookableMixin
Asynchronous resolver with cache and timeout.
Parameters: - family – IPv4 or IPv6 preference.
- timeout – A time in seconds used for timing-out requests. If not specified, this class relies on the underlying libraries.
- bind_address – An IP address to bind DNS requests if possible.
- cache – Cache to store results of any query.
- rotate – If result is cached rotates the results, otherwise, shuffle the results.
-
resolve
(host: str) → wpull.network.dns.ResolveResult[source]¶ Resolve hostname.
Parameters: host – Hostname.
Returns: Resolved IP addresses.
Raises: - DNSNotFound if the hostname could not be resolved or
- NetworkError if there was an error connecting to DNS servers.
Coroutine.
-
static
resolve_dns
(host: str) → str[source]¶ Resolve the hostname to an IP address.
Parameters: host – The hostname. This callback is to override the DNS lookup.
It is useful when the server is no longer available to the public. Typically, large infrastructures will change the DNS settings to make clients no longer hit the front-ends, but rather go towards a static HTTP server with a “We’ve been acqui-hired!” page. In these cases, the original servers may still be online.
Returns: None
to use the original behavior or a string containing an IP address or an alternate hostname.Return type: str, None
network.pool
Module¶
-
class
wpull.network.pool.
ConnectionPool
(max_host_count: int=6, resolver: typing.Union=None, connection_factory: typing.Union=None, ssl_connection_factory: typing.Union=None, max_count: int=100)[source]¶ Bases:
object
Connection pool.
Parameters: - max_host_count – Number of connections per host.
- resolver – DNS resolver.
- connection_factory – A function that accepts
address
andhostname
arguments and returns aConnection
instance. - ssl_connection_factory – A function that returns a
SSLConnection
instance. See connection_factory. - max_count – Limit on number of connections
-
acquire
(host: str, port: int, use_ssl: bool=False, host_key: typing.Any=None) → wpull.network.connection.Connection[source]¶ Return an available connection.
Parameters: - host – A hostname or IP address.
- port – Port number.
- use_ssl – Whether to return a SSL connection.
- host_key – If provided, it overrides the key used for per-host connection pooling. This is useful for proxies for example.
Coroutine.
-
clean
(force: bool=False)[source]¶ Clean all closed connections.
Parameters: force – Clean connected and idle connections too. Coroutine.
-
close
()[source]¶ Close all the connections and clean up.
This instance will not be usable after calling this method.
-
host_pools
¶
-
no_wait_release
(connection: wpull.network.connection.Connection)[source]¶ Synchronous version of
release()
.
-
class
wpull.network.pool.
HappyEyeballsConnection
(address, connection_factory, resolver, happy_eyeballs_table, is_ssl=False)[source]¶ Bases:
object
Wrapper for happy eyeballs connection.
-
class
wpull.network.pool.
HostPool
(connection_factory: typing.Callable, max_connections: int=6)[source]¶ Bases:
object
Connection pool for a host.
-
ready
¶ Queue
Connections not in use.
-
busy
¶ set
Connections in use.
-
acquire
() → wpull.network.connection.Connection[source]¶ Register and return a connection.
Coroutine.
-
clean
(force: bool=False)[source]¶ Clean closed connections.
Parameters: force – Clean connected and idle connections too. Coroutine.
-
observer
Module¶
Observer.
path
Module¶
File names and paths.
-
class
wpull.path.
PathNamer
(root, index='index.html', use_dir=False, cut=None, protocol=False, hostname=False, os_type='unix', no_control=True, ascii_only=True, case=None, max_filename_length=None)[source]¶ Bases:
wpull.path.BasePathNamer
Path namer that creates a directory hierarchy based on the URL.
Parameters: - root (str) – The base path.
- index (str) – The filename to use when the URL path does not indicate one.
- use_dir (bool) – Include directories based on the URL path.
- cut (int) – Number of leading directories to cut from the file path.
- protocol (bool) – Include the URL scheme in the directory structure.
- hostname (bool) – Include the hostname in the directory structure.
- safe_filename_args (dict) – Keyword arguments for safe_filename.
See also:
url_to_filename()
,url_to_dir_path()
,safe_filename()
.
-
class
wpull.path.
PercentEncoder
(unix=False, control=False, windows=False, ascii_=False)[source]¶ Bases:
collections.defaultdict
Percent encoder.
-
wpull.path.
anti_clobber_dir_path
(dir_path, suffix='.d')[source]¶ Return a directory path free of filenames.
Parameters: - dir_path (str) – A directory path.
- suffix (str) – The suffix to append to the part of the path that is a file.
Returns: str
-
wpull.path.
safe_filename
(filename, os_type='unix', no_control=True, ascii_only=True, case=None, encoding='utf8', max_length=None)[source]¶ Return a safe filename or path part.
Parameters: - filename (str) – The filename or path component.
- os_type (str) – If
unix
, escape the slash. Ifwindows
, escape extra Windows characters. - no_control (bool) – If True, escape control characters.
- ascii_only (bool) – If True, escape non-ASCII characters.
- case (str) – If
lower
, lowercase the string. Ifupper
, uppercase the string. - encoding (str) – The character encoding.
- max_length (int) – The maximum length of the filename.
This function assumes that filename has not already been percent-encoded.
Returns: str
-
wpull.path.
url_to_dir_parts
(url, include_protocol=False, include_hostname=False, alt_char=False)[source]¶ Return a list of directory parts from a URL.
Parameters: - url (str) – The URL.
- include_protocol (bool) – If True, the scheme from the URL will be included.
- include_hostname (bool) – If True, the hostname from the URL will be included.
- alt_char (bool) – If True, the character for the port deliminator
will be
+
intead of:
.
This function does not include the filename and the paths are not sanitized.
Returns: list
-
wpull.path.
url_to_filename
(url, index='index.html', alt_char=False)[source]¶ Return a filename from a URL.
Parameters: - url (str) – The URL.
- index (str) – If a filename could not be derived from the URL path,
use index instead. For example,
/images/
will returnindex.html
. - alt_char (bool) – If True, the character for the query deliminator
will be
@
intead of?
.
This function does not include the directories and does not sanitize the filename.
Returns: str
pipeline
Module¶
pipeline.app
Module¶
-
class
wpull.pipeline.app.
AppSession
(factory: wpull.application.factory.Factory, args, stderr)[source]¶ Bases:
object
pipeline.item
Module¶
URL items.
-
class
wpull.pipeline.item.
LinkType
[source]¶ Bases:
enum.Enum
The type of contents that a link is expected to have.
-
css
= None¶ Stylesheet file. Recursion on links is usually safe.
-
directory
= None¶ FTP directory.
-
file
= None¶ FTP File.
-
html
= None¶ HTML document.
-
javascript
= None¶ JavaScript file. Possible to recurse links on this file.
-
media
= None¶ Image or video file. Recursion on this type will not be useful.
-
sitemap
= None¶ A Sitemap.xml file.
-
-
class
wpull.pipeline.item.
Status
[source]¶ Bases:
enum.Enum
URL status.
-
done
= None¶ The item has been processed successfully.
-
error
= None¶ The item encountered an error during processing.
-
in_progress
= None¶ The item is in progress of being processed.
-
skipped
= None¶ The item was excluded from processing due to some rejection filters.
-
todo
= None¶ The item has not yet been processed.
-
-
class
wpull.pipeline.item.
URLData
[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixin
Data associated fetching the URL.
- post_data (str): If given, the URL should be fetched as a
- POST request containing post_data.
-
database_attributes
= ('post_data',)¶
-
class
wpull.pipeline.item.
URLProperties
[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixin
URL properties that determine whether a URL is fetched.
-
parent_url
¶ str
The parent or referral URL that linked to this URL.
-
root_url
¶ str
The earliest ancestor URL of this URL. This URL is typically the URL supplied at the start of the program.
-
status
¶ Status
Processing status of this URL.
-
try_count
¶ int
The number of attempts on this URL.
-
level
¶ int
The recursive depth of this URL. A level of
0
indicates the URL was initially supplied to the program (the top URL). Level1
means the URL was linked from the top URL.
-
inline_level
¶ int
Whether this URL was an embedded object (such as an image or a stylesheet) of the parent URL.
The value represents the recursive depth of the object. For example, an iframe is depth 1 and the images in the iframe is depth 2.
-
link_type
¶ LinkType
Describes the expected document type.
-
database_attributes
= ('parent_url', 'root_url', 'status', 'try_count', 'level', 'inline_level', 'link_type', 'priority')¶
-
parent_url_info
¶ Return URL Info for the parent URL
-
root_url_info
¶ Return URL Info for the root URL
-
-
class
wpull.pipeline.item.
URLRecord
[source]¶ Bases:
wpull.pipeline.item.URLProperties
,wpull.pipeline.item.URLData
,wpull.pipeline.item.URLResult
An entry in the URL table describing a URL to be downloaded.
-
url
¶ str
The URL.
-
url_info
¶ Return URL Info for this URL
-
-
class
wpull.pipeline.item.
URLResult
[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixin
Data associated with the fetched URL.
status_code (int): The HTTP or FTP status code. filename (str): The path to where the file was saved.
-
database_attributes
= ('status_code', 'filename')¶
-
pipeline.pipeline
Module¶
-
class
wpull.pipeline.pipeline.
Pipeline
(item_source: wpull.pipeline.pipeline.ItemSource, tasks: typing.Sequence, item_queue: typing.Union=None)[source]¶ Bases:
object
-
concurrency
¶
-
tasks
¶
-
-
class
wpull.pipeline.pipeline.
PipelineSeries
(pipelines: typing.Iterator)[source]¶ Bases:
object
-
concurrency
¶
-
concurrency_pipelines
¶
-
pipelines
¶
-
pipeline.progress
Module¶
-
class
wpull.pipeline.progress.
BarProgress
(*args, draw_interval: float=0.5, bar_width: int=25, human_format: bool=True, **kwargs)[source]¶
-
class
wpull.pipeline.progress.
Progress
(stream: typing.IO=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='ANSI_X3.4-1968'>)[source]¶ Bases:
wpull.application.hook.HookableMixin
Print file download progress as dots or a bar.
Parameters: - bar_style (bool) – If True, print as a progress bar. If False, print dots every few seconds.
- stream – A file object. Default is usually stderr.
- human_format (true) – If True, format sizes in units. Otherwise, output bits only.
-
class
wpull.pipeline.progress.
ProtocolProgress
(*args, **kwargs)[source]¶ Bases:
wpull.pipeline.progress.Progress
-
ProtocolProgress.
update_from_begin_request
(request: wpull.protocol.abstract.request.BaseRequest)[source]¶
-
ProtocolProgress.
update_from_begin_response
(response: wpull.protocol.abstract.request.BaseResponse)[source]¶
-
pipeline.session
Module¶
-
class
wpull.pipeline.session.
ItemSession
(app_session: wpull.pipeline.app.AppSession, url_record: wpull.pipeline.item.URLRecord)[source]¶ Bases:
object
Item for a URL that needs to processed.
-
add_child_url
(url: str, inline: bool=False, link_type: typing.Union=None, post_data: typing.Union=None, level: typing.Union=None, replace: bool=False)[source]¶ Add links scraped from the document with automatic values.
Parameters: - url – A full URL. (It can’t be a relative path.)
- inline – Whether the URL is an embedded object.
- link_type – Expected link type.
- post_data – URL encoded form data. The request will be made using POST. (Don’t use this to upload files.)
- level – The child depth of this URL.
- replace – Whether to replace the existing entry in the database table so it will be redownloaded again.
This function provides values automatically for:
inline
level
parent
: The referrering page.root
See also
add_url()
.
-
child_url_record
(url: str, inline: bool=False, link_type: typing.Union=None, post_data: typing.Union=None, level: typing.Union=None)[source]¶ Return a child URLRecord.
This function is useful for testing filters before adding to table.
-
is_processed
¶ Return whether the item has been processed.
-
is_virtual
¶
-
request
¶
-
response
¶
-
processor
Module¶
Item processing.
processor.base
Module¶
Base classes for processors.
-
class
wpull.processor.base.
BaseProcessor
[source]¶ Bases:
object
Base class for processors.
Processors contain the logic for processing requests.
-
class
wpull.processor.base.
BaseProcessorSession
[source]¶ Bases:
object
Base class for processor sessions.
-
wpull.processor.base.
REMOTE_ERRORS
= (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.NetworkError'>)¶ List of error classes that are errors that occur with a server.
processor.coprocessor
Module¶
Additional processing not associated with Wget behavior.
processor.coprocessor.phantomjs
Module¶
PhantomJS page loading and scrolling.
-
class
wpull.processor.coprocessor.phantomjs.
PhantomJSCoprocessor
(phantomjs_driver_factory: typing.Callable, processing_rule: wpull.processor.rule.ProcessingRule, phantomjs_params: wpull.processor.coprocessor.phantomjs.PhantomJSParamsType, warc_recorder=None, root_path='.')[source]¶ Bases:
object
PhantomJS coprocessor.
Parameters: - phantomjs_driver_factory – Callback function that accepts
params
argument and returns PhantomJSDriver - processing_rule – Processing rule.
- warc_recorder – WARC recorder.
- root_dir (str) – Root directory path for temp files.
- phantomjs_driver_factory – Callback function that accepts
-
class
wpull.processor.coprocessor.phantomjs.
PhantomJSCoprocessorSession
(phantomjs_driver_factory, root_path, processing_rule, file_writer_session, request, response, item_session: wpull.pipeline.session.ItemSession, params, warc_recorder)[source]¶ Bases:
object
PhantomJS coprocessor session.
-
exception
wpull.processor.coprocessor.phantomjs.
PhantomJSCrashed
[source]¶ Bases:
Exception
PhantomJS exited with non-zero code.
-
wpull.processor.coprocessor.phantomjs.
PhantomJSParams
¶ PhantomJS parameters
-
wpull.processor.coprocessor.phantomjs.
snapshot_type
¶ list
File types. Accepted are html, pdf, png, gif.
-
wpull.processor.coprocessor.phantomjs.
wait_time
¶ float
Time between page scrolls.
-
wpull.processor.coprocessor.phantomjs.
num_scrolls
¶ int
Maximum number of scrolls.
-
wpull.processor.coprocessor.phantomjs.
smart_scroll
¶ bool
Whether to stop scrolling if number of requests & responses do not change.
-
wpull.processor.coprocessor.phantomjs.
snapshot
¶ bool
Whether to take snapshot files.
-
wpull.processor.coprocessor.phantomjs.
viewport_size
¶ tuple
Width and height of the page viewport.
-
wpull.processor.coprocessor.phantomjs.
paper_size
¶ tuple
Width and height of the paper size.
-
wpull.processor.coprocessor.phantomjs.
load_time
¶ float
Maximum time to wait for page load.
-
wpull.processor.coprocessor.phantomjs.
custom_headers
¶ dict
Default HTTP headers.
-
wpull.processor.coprocessor.phantomjs.
page_settings
¶ dict
Page settings.
alias of
PhantomJSParamsType
-
processor.coprocessor.proxy
Module¶
-
class
wpull.processor.coprocessor.proxy.
ProxyCoprocessor
(app_session: wpull.pipeline.app.AppSession)[source]¶ Bases:
object
Proxy coprocessor.
processor.coprocessor.youtubedl
Module¶
-
class
wpull.processor.coprocessor.youtubedl.
Session
(proxy_address, youtube_dl_path, root_path, item_session: wpull.pipeline.session.ItemSession, file_writer_session, user_agent, warc_recorder, inet_family, check_certificate)[source]¶ Bases:
object
youtube-dl session.
processor.delegate
Module¶
Delegation to other processor.
processor.ftp
Module¶
FTP
-
class
wpull.processor.ftp.
FTPProcessor
(ftp_client: wpull.protocol.ftp.client.Client, fetch_params)[source]¶ Bases:
wpull.processor.base.BaseProcessor
FTP processor.
Parameters: - ftp_client – The FTP client.
- fetch_params (
WebProcessorFetchParams
) – Parameters for fetching.
-
fetch_params
¶ The fetch parameters.
-
ftp_client
¶ The ftp client.
-
listing_cache
¶ Listing cache.
Returns: A cache mapping from URL to list of ftp.ls.listing.FileEntry
.
-
wpull.processor.ftp.
FTPProcessorFetchParams
¶ FTPProcessorFetchParams
Parameters: - remove_listing (bool) – Remove .listing files after fetching.
- glob (bool) – Enable URL globbing.
- preserve_permissions (bool) – Preserve file permissions.
- follow_symlinks (bool) – Follow symlinks.
alias of
FTPProcessorFetchParamsType
-
class
wpull.processor.ftp.
FTPProcessorSession
(processor: wpull.processor.ftp.FTPProcessor, item_session: wpull.pipeline.session.ItemSession)[source]¶ Bases:
wpull.processor.base.BaseProcessorSession
Fetches FTP files or directory listings.
-
exception
wpull.processor.ftp.
HookPreResponseBreak
[source]¶ Bases:
wpull.errors.ProtocolError
Hook pre-response break.
processor.rule
Module¶
Fetching rules.
-
class
wpull.processor.rule.
FetchRule
(url_filter: wpull.urlfilter.DemuxURLFilter=None, robots_txt_checker: wpull.protocol.http.robots.RobotsTxtChecker=None, http_login: typing.Union=None, ftp_login: typing.Union=None, duration_timeout: typing.Union=None)[source]¶ Bases:
wpull.application.hook.HookableMixin
Decide on what URLs should be fetched.
-
check_ftp_request
(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
check_generic_request
(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple[source]¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
check_initial_web_request
(item_session: wpull.pipeline.session.ItemSession, request: wpull.protocol.http.request.Request) → typing.Tuple[source]¶ Check robots.txt, URL filters, and scripting hook.
Returns: (bool, str) Return type: tuple Coroutine.
-
check_subsequent_web_request
(item_session: wpull.pipeline.session.ItemSession, is_redirect: bool=False) → typing.Tuple[source]¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
consult_filters
(url_info: wpull.url.URLInfo, url_record: wpull.pipeline.item.URLRecord, is_redirect: bool=False) → typing.Tuple[source]¶ Consult the URL filter.
Parameters: - url_record – The URL record.
- is_redirect – Whether the request is a redirect and it is desired that it spans hosts.
- Returns
tuple:
- bool: The verdict
- str: A short reason string: nofilters, filters, redirect
- dict: The result from
DemuxURLFilter.test_info()
-
consult_hook
(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reason: str, test_info: dict)[source]¶ Consult the scripting hook.
Returns: (bool, str) Return type: tuple
-
consult_robots_txt
(request: wpull.protocol.http.request.Request) → bool[source]¶ Consult by fetching robots.txt as needed.
Parameters: request – The request to be made to get the file. Returns: True if can fetch Coroutine
-
classmethod
is_only_span_hosts_failed
(test_info: dict) → bool[source]¶ Return whether only the SpanHostsFilter failed.
-
static
plugin_accept_url
(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reasons: dict) → bool[source]¶ Return whether to download this URL.
Parameters: - item_session – Current URL item.
- verdict – A bool indicating whether Wpull wants to download the URL.
- reasons –
A dict containing information for the verdict:
filters
(dict): A mapping (str to bool) from filter name to whether the filter passed or not.reason
(str): A short reason string. Current values are:filters
,robots
,redirect
.
Returns: If
True
, the URL should be downloaded. Otherwise, the URL is skipped.
-
-
class
wpull.processor.rule.
ProcessingRule
(fetch_rule: wpull.processor.rule.FetchRule, document_scraper: wpull.scraper.base.DemuxDocumentScraper=None, sitemaps: bool=False, url_rewriter: wpull.urlrewrite.URLRewriter=None)[source]¶ Bases:
wpull.application.hook.HookableMixin
Document processing rules.
Parameters: - fetch_rule – The FetchRule instance.
- document_scraper – The document scraper.
-
add_extra_urls
(item_session: wpull.pipeline.session.ItemSession)[source]¶ Add additional URLs such as robots.txt, favicon.ico.
-
static
parse_url
(url, encoding='utf-8')¶ Parse and return a URLInfo.
This function logs a warning if the URL cannot be parsed and returns None.
-
static
plugin_get_urls
(item_session: wpull.pipeline.session.ItemSession)[source]¶ Add additional URLs to be added to the URL Table.
When this event is dispatched, the caller should add any URLs needed using
ItemSession.add_child_url()
.
-
class
wpull.processor.rule.
ResultRule
(ssl_verification: bool=False, retry_connrefused: bool=False, retry_dns_error: bool=False, waiter: typing.Union=None, statistics: typing.Union=None)[source]¶ Bases:
wpull.application.hook.HookableMixin
Decide on the results of a fetch.
Parameters: - ssl_verification – If True, don’t ignore certificate errors.
- retry_connrefused – If True, don’t consider a connection refused error to be a permanent error.
- retry_dns_error – If True, don’t consider a DNS resolution error to be permanent error.
- waiter – The Waiter.
- statistics – The Statistics.
-
consult_error_hook
(item_session: wpull.pipeline.session.ItemSession, error: BaseException)[source]¶ Return scripting action when an error occured.
-
consult_pre_response_hook
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return scripting action when a response begins.
-
consult_response_hook
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return scripting action when a response ends.
-
get_wait_time
(item_session: wpull.pipeline.session.ItemSession, error=None)[source]¶ Return the wait time in seconds between requests.
-
handle_document
(item_session: wpull.pipeline.session.ItemSession, filename: str) → wpull.application.hook.Actions[source]¶ Process a successful document response.
Returns: A value from hook.Actions
.
-
handle_document_error
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for when the document only describes an server error.
Returns: A value from hook.Actions
.
-
handle_error
(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]¶ Process an error.
Returns: A value from hook.Actions
.
-
handle_intermediate_response
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for successful intermediate responses.
Returns: A value from hook.Actions
.
-
handle_no_document
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for successful responses containing no useful document.
Returns: A value from hook.Actions
.
-
handle_pre_response
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Process a response that is starting.
-
handle_response
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Generic handler for a response.
Returns: A value from hook.Actions
.
-
static
plugin_handle_error
(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]¶ Return an action to handle the error.
Parameters: - item_session –
- error –
Returns: A value from
Actions
. The default isActions.NORMAL
.
-
static
plugin_handle_pre_response
(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return an action to handle a response status before a download.
Parameters: item_session – Returns: A value from Actions
. The default isActions.NORMAL
.
processor.web
Module¶
Web processing.
-
exception
wpull.processor.web.
HookPreResponseBreak
[source]¶ Bases:
wpull.errors.ProtocolError
Hook pre-response break.
-
class
wpull.processor.web.
WebProcessor
(web_client: wpull.protocol.http.web.WebClient, fetch_params: wpull.processor.web.WebProcessorFetchParamsType)[source]¶ Bases:
wpull.processor.base.BaseProcessor
,wpull.application.hook.HookableMixin
HTTP processor.
Parameters: - web_client – The web client.
- fetch_params – Fetch parameters
See also
-
DOCUMENT_STATUS_CODES
= (200, 204, 206, 304)¶ Default status codes considered successfully fetching a document.
-
NO_DOCUMENT_STATUS_CODES
= (401, 403, 404, 405, 410)¶ Default status codes considered a permanent error.
-
fetch_params
¶ The fetch parameters.
-
web_client
¶ The web client.
-
wpull.processor.web.
WebProcessorFetchParams
¶ WebProcessorFetchParams
Parameters: - post_data (str) – If provided, all requests will be POSTed with the given post_data. post_data must be in percent-encoded query format (“application/x-www-form-urlencoded”).
- strong_redirects (bool) – If True, redirects are allowed to span hosts.
alias of
WebProcessorFetchParamsType
-
class
wpull.processor.web.
WebProcessorSession
(processor: wpull.processor.web.WebProcessor, item_session: wpull.pipeline.session.ItemSession)[source]¶ Bases:
wpull.processor.base.BaseProcessorSession
Fetches an HTTP document.
This Processor Session will handle document redirects within the same Session. HTTP errors such as 404 are considered permanent errors. HTTP errors like 500 are considered transient errors and are handled in subsequence sessions by marking the item as “error”.
If a successful document has been downloaded, it will be scraped for URLs to be added to the URL table. This Processor Session is very simple; it cannot handle JavaScript or Flash plugins.
protocol
Module¶
protocol.abstract
Module¶
Conversation abstractions.
protocol.abstract.client
Module¶
Client abstractions
-
class
wpull.protocol.abstract.client.
BaseClient
(connection_pool: typing.Union=None)[source]¶ Bases:
typing.Generic
,wpull.application.hook.HookableMixin
Base client.
-
class
wpull.protocol.abstract.client.
BaseSession
(connection_pool)[source]¶ Bases:
wpull.application.hook.HookableMixin
Base session.
-
exception
wpull.protocol.abstract.client.
DurationTimeout
[source]¶ Bases:
wpull.errors.NetworkTimedOut
Download did not complete within specified time.
protocol.abstract.request
Module¶
Request object abstractions
-
class
wpull.protocol.abstract.request.
BaseResponse
[source]¶ Bases:
wpull.protocol.abstract.request.ProtocolResponseMixin
-
class
wpull.protocol.abstract.request.
ProtocolResponseMixin
[source]¶ Bases:
object
Protocol abstraction for response objects.
-
protocol
¶ Name of the protocol.
Returns: Either ftp
orhttp
.Return type: str
-
protocol.abstract.stream
Module¶
Abstract stream classes
protocol.ftp
Module¶
File transfer protocol.
protocol.ftp.client
Module¶
FTP client.
-
class
wpull.protocol.ftp.client.
Client
(*args, **kwargs)[source]¶ Bases:
wpull.protocol.abstract.client.BaseClient
FTP Client.
The session object is
Session
.
-
class
wpull.protocol.ftp.client.
Session
(login_table: weakref.WeakKeyDictionary, **kwargs)[source]¶ Bases:
wpull.protocol.abstract.client.BaseSession
-
Session.
download
(file: typing.Union=None, rewind: bool=True, duration_timeout: typing.Union=None) → wpull.protocol.ftp.request.Response[source]¶ Read the response content into file.
Parameters: - file – A file object or asyncio stream.
- rewind – Seek the given file back to its original offset after reading is finished.
- duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns: A Response populated with the final data connection reply.
Be sure to call
start()
first.Coroutine.
-
Session.
download_listing
(file: typing.Union, duration_timeout: typing.Union=None) → wpull.protocol.ftp.request.ListingResponse[source]¶ Read file listings.
Parameters: - file – A file object or asyncio stream.
- duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns: A Response populated the file listings
Be sure to call
start_file_listing()
first.Coroutine.
-
Session.
start
(request: wpull.protocol.ftp.request.Request) → wpull.protocol.ftp.request.Response[source]¶ Start a file or directory listing download.
Parameters: request – Request. Returns: A Response populated with the initial data connection reply. Once the response is received, call
download()
.Coroutine.
-
Session.
start_listing
(request: wpull.protocol.ftp.request.Request) → wpull.protocol.ftp.request.ListingResponse[source]¶ Fetch a file listing.
Parameters: request – Request. Returns: A listing response populated with the initial data connection reply. Once the response is received, call
download_listing()
.Coroutine.
-
protocol.ftp.command
Module¶
FTP service control.
-
class
wpull.protocol.ftp.command.
Commander
(data_stream)[source]¶ Bases:
object
Helper class that performs typical FTP routines.
Parameters: control_stream ( ftp.stream.ControlStream
) – The control stream.-
begin_stream
(command: wpull.protocol.ftp.request.Command) → wpull.protocol.ftp.request.Reply[source]¶ Start sending content on the data stream.
Parameters: - command – A command that tells the server to send data over the
- connection. (data) –
Coroutine.
Returns: The begin reply.
-
passive_mode
() → typing.Tuple[source]¶ Enable passive mode.
Returns: The address (IP address, port) of the passive port. Coroutine.
-
classmethod
raise_if_not_match
(action: str, expected_code: typing.Union, reply: wpull.protocol.ftp.request.Reply)[source]¶ Raise FTPServerError if not expected reply code.
Parameters: - action – Label to use in the exception message.
- expected_code – Expected 3 digit code.
- reply – Reply from the server.
-
read_stream
(file: typing.IO, data_stream: wpull.protocol.ftp.stream.DataStream) → wpull.protocol.ftp.request.Reply[source]¶ Read from the data stream.
Parameters: - file – A destination file object or a stream writer.
- data_stream – The stream of which to read from.
Coroutine.
Returns: The final reply. Return type: Reply
-
setup_data_stream
(connection_factory: typing.Callable, data_stream_factory: typing.Callable=<class 'wpull.protocol.ftp.stream.DataStream'>) → wpull.protocol.ftp.stream.DataStream[source]¶ Create and setup a data stream.
This function will set up passive and binary mode and handle connecting to the data connection.
Parameters: - connection_factory – A coroutine callback that returns a connection
- data_stream_factory – A callback that returns a data stream
Coroutine.
Returns: DataStream
-
protocol.ftp.ls
Module¶
I-tried-my-best LIST parsing package.
protocol.ftp.ls.date
Module¶
Date and time parsing
-
wpull.protocol.ftp.ls.date.
AM_STRINGS
= {'vorm', 'पूर्व', 'a. m', 'am', '午前', '上午', 'ص'}¶ Set of AM day period strings.
-
wpull.protocol.ftp.ls.date.
DAY_PERIOD_PATTERN
= re.compile('(nachm|vorm|م|पूर्व|a. m|午後|अपर|pm|下午|am|p. m|午前|上午|ص)\\b', re.IGNORECASE)¶ Regex pattern for AM/PM string.
-
wpull.protocol.ftp.ls.date.
ISO_8601_DATE_PATTERN
= re.compile('(\\d{4})(?!\\d)[\\w./-](\\d{1,2})(?!\\d)[\\w./-](\\d{1,2})')¶ Regex pattern for dates similar to YYYY-MM-DD.
-
wpull.protocol.ftp.ls.date.
MMM_DD_YY_PATTERN
= re.compile('([^\\W\\d_]{3,4})\\s{0,4}(\\d{1,2})\\s{0,4}(\\d{0,4})')¶ Regex pattern for dates similar to MMM DD YY.
Example: Feb 09 90
-
wpull.protocol.ftp.ls.date.
MONTH_MAP
= {'أبريل': 4, 'juni': 6, 'set': 9, 'lis': 11, 'juil': 7, 'lip': 7, 'aug': 8, 'sie': 8, 'जन': 1, 'يوليو': 7, 'नवं': 11, 'lut': 2, 'oct': 10, '7月': 7, 'juin': 6, 'فبراير': 2, '3月': 3, 'dec': 12, 'मार्च': 3, 'अक्टू': 10, 'sty': 1, 'जुला': 7, 'juli': 7, 'أكتوبر': 10, 'марта': 3, 'jan': 1, 'янв': 1, 'нояб': 11, 'ديسمبر': 12, 'apr': 4, 'अग': 8, 'août': 8, 'ago': 8, 'июня': 6, 'окт': 10, 'févr': 2, 'मई': 5, '8月': 8, 'ene': 1, 'сент': 9, 'نوفمبر': 11, '9月': 9, 'nov': 11, '5月': 5, '10月': 10, 'jul': 7, 'يناير': 1, 'जून': 6, 'mars': 3, 'déc': 12, 'dez': 12, 'dic': 12, 'okt': 10, 'апр': 4, 'avr': 4, 'mai': 5, 'gru': 12, '6月': 6, 'июля': 7, '12月': 12, 'wrz': 9, 'out': 10, 'авг': 8, 'फ़र': 2, 'мая': 5, 'февр': 2, 'سبتمبر': 9, 'feb': 2, 'अप्रै': 4, 'maj': 5, 'fev': 2, 'مارس': 3, '1月': 1, 'may': 5, 'mar': 3, '4月': 4, 'jun': 6, 'दिसं': 12, 'paź': 10, 'sep': 9, 'kwi': 4, '11月': 11, '2月': 2, 'abr': 4, 'सितं': 9, 'märz': 3, 'مايو': 5, 'أغسطس': 8, 'sept': 9, 'janv': 1, 'дек': 12, 'cze': 6, 'يونيو': 6}¶ Month names to int.
-
wpull.protocol.ftp.ls.date.
NN_NN_NNNN_PATTERN
= re.compile('(\\d{1,2})[./-](\\d{1,2})[./-](\\d{2,4})')¶ Regex pattern for dates similar to NN NN YYYY.
Example: 2/9/90
-
wpull.protocol.ftp.ls.date.
PM_STRINGS
= {'nachm', 'م', '午後', 'अपर', 'pm', '下午', 'p. m'}¶ Set of PM day period strings.
-
wpull.protocol.ftp.ls.date.
TIME_PATTERN
= re.compile('(\\d{1,2}):(\\d{2}):?(\\d{0,2})\\s?(nachm|vorm|م|पूर्व|a. m|午後|अपर|pm|下午|am|p. m|午前|上午|ص|\x08)?')¶ Regex pattern for time in HH:MM[:SS]
-
wpull.protocol.ftp.ls.date.
guess_datetime_format
(lines: typing.Iterable, threshold: int=5) → typing.Tuple[source]¶ Guess whether order of the year, month, day and 12/24 hour.
Returns: First item is either str ymd
,dmy
,mdy
orNone
. Second item is either True for 12-hour time or False for 24-hour time or None.Return type: tuple
-
wpull.protocol.ftp.ls.date.
parse_cldr_json
(directory, language_codes=('zh', 'es', 'en', 'hi', 'ar', 'pt', 'ru', 'ja', 'de', 'fr', 'pl'), massage=True)[source]¶ Parse CLDR JSON datasets to for date time things.
protocol.ftp.ls.listing
Module¶
Listing parser.
-
wpull.protocol.ftp.ls.listing.
FileEntry
¶ A row in a listing.
-
wpull.protocol.ftp.ls.listing.
name
¶ str
Filename.
-
wpull.protocol.ftp.ls.listing.
type
¶ str, None
file
,dir
,symlink
,other
,None
-
wpull.protocol.ftp.ls.listing.
size
¶ int, None
Size of file.
-
wpull.protocol.ftp.ls.listing.
date
¶ datetime.datetime
, NoneA datetime object in UTC.
-
wpull.protocol.ftp.ls.listing.
dest
¶ str, None
Destination filename for symlinks.
-
wpull.protocol.ftp.ls.listing.
perm
¶ int, None
Unix permissions expressed as an integer.
alias of
FileEntryType
-
-
class
wpull.protocol.ftp.ls.listing.
LineParser
[source]¶ Bases:
object
Parse individual lines in a listing.
-
exception
wpull.protocol.ftp.ls.listing.
ListingError
[source]¶ Bases:
ValueError
Error during parsing a listing.
-
class
wpull.protocol.ftp.ls.listing.
ListingParser
(text=None, file=None)[source]¶ Bases:
wpull.protocol.ftp.ls.listing.LineParser
Listing parser.
Parameters: - text (str) – A text listing.
- file – A file object in text mode containing the listing.
-
exception
wpull.protocol.ftp.ls.listing.
UnknownListingError
[source]¶ Bases:
wpull.protocol.ftp.ls.listing.ListingError
Failed to determine type of listing.
-
wpull.protocol.ftp.ls.listing.
guess_listing_type
(lines, threshold=100)[source]¶ Guess the style of directory listing.
Returns: unix
,msdos
,nlst
,unknown
.Return type: str
protocol.ftp.request
Module¶
FTP conversation classes
-
class
wpull.protocol.ftp.request.
Command
(name=None, argument='')[source]¶ Bases:
wpull.protocol.abstract.request.SerializableMixin
,wpull.protocol.abstract.request.DictableMixin
FTP request command.
Encoding is UTF-8.
-
name
¶ str
The command. Usually 4 characters or less.
-
argument
¶ str
Optional argument for the command.
-
name
-
-
class
wpull.protocol.ftp.request.
ListingResponse
[source]¶ Bases:
wpull.protocol.ftp.request.Response
FTP response for a file listing.
-
files
¶ list
A list of
ftp.ls.listing.FileEntry
-
-
class
wpull.protocol.ftp.request.
Reply
(code=None, text=None)[source]¶ Bases:
wpull.protocol.abstract.request.SerializableMixin
,wpull.protocol.abstract.request.DictableMixin
FTP reply.
Encoding is always UTF-8.
-
code
¶ int
Reply code.
-
text
¶ str
Reply message.
-
-
class
wpull.protocol.ftp.request.
Request
(url)[source]¶ Bases:
wpull.protocol.abstract.request.BaseRequest
,wpull.protocol.abstract.request.URLPropertyMixin
FTP request for a file.
-
address
¶ tuple
Address of control connection.
-
data_address
¶ tuple
Address of data connection.
-
username
¶ str, None
Username for login.
-
password
¶ str, None
Password for login.
-
restart_value
¶ int, None
Optional value for
REST
command.
-
file_path
¶ str
Path of the file.
-
file_path
-
-
class
wpull.protocol.ftp.request.
Response
[source]¶ Bases:
wpull.protocol.abstract.request.BaseResponse
,wpull.protocol.abstract.request.DictableMixin
FTP response for a file.
-
file_transfer_size
¶ int
Size of the file transfer without considering restart. (REST is issued last.)
This is will be the file size. (STREAM mode is always used.)
-
restart_value
¶ int
Offset value of restarted transfer.
-
protocol
¶
-
protocol.ftp.stream
Module¶
FTP Streams
-
class
wpull.protocol.ftp.stream.
ControlStream
(connection: wpull.network.connection.Connection)[source]¶ Bases:
object
Stream class for a control connection.
Parameters: connection – Connection. -
data_event_dispatcher
¶
-
read_reply
() → wpull.protocol.ftp.request.Reply[source]¶ Read a reply from the stream.
Returns: The reply Return type: ftp.request.Reply Coroutine.
-
protocol.ftp.util
Module¶
Utils
-
exception
wpull.protocol.ftp.util.
FTPServerError
[source]¶ Bases:
wpull.errors.ServerError
-
reply_code
¶ Return reply code.
-
-
class
wpull.protocol.ftp.util.
ReplyCodes
[source]¶ Bases:
object
-
bad_sequence_of_commands
= 503¶
-
cant_open_data_connection
= 425¶
-
closing_data_connection
= 226¶
-
command_not_implemented
= 502¶
-
command_not_implemented_for_that_parameter
= 504¶
-
command_not_implemented_superfluous_at_this_site
= 202¶
-
command_okay
= 200¶
-
connection_closed_transfer_aborted
= 426¶
-
data_connection_already_open_transfer_starting
= 125¶
-
data_connection_open_no_transfer_in_progress
= 225¶
-
directory_status
= 212¶
-
entering_passive_mode
= 227¶
-
file_status
= 213¶
-
file_status_okay_about_to_open_data_connection
= 150¶
-
help_message
= 214¶
-
name_system_type
= 215¶
-
need_account_for_login
= 332¶
-
need_account_for_storing_files
= 532¶
-
not_logged_in
= 530¶
-
pathname_created
= 257¶
-
requested_action_aborted_local_error_in_processing
= 451¶
-
requested_action_aborted_page_type_unknown
= 551¶
-
requested_action_not_taken_file_name_not_allowed
= 553¶
-
requested_action_not_taken_insufficient_storage_space
= 452¶
-
requested_file_action_aborted
= 552¶
-
requested_file_action_not_taken
= 450¶
-
requested_file_action_okay_completed
= 250¶
-
requested_file_action_pending_further_information
= 350¶
-
restart_marker_reply
= 110¶
-
service_closing_control_connection
= 221¶
-
service_not_available_closing_control_connection
= 421¶
-
service_ready_for_new_user
= 220¶
-
service_ready_in_nnn_minutes
= 120¶
-
syntax_error_command_unrecognized
= 500¶
-
syntax_error_in_parameters_or_arguments
= 501¶
-
system_status_or_system_help_reply
= 211¶
-
user_logged_in_proceed
= 230¶
-
user_name_okay_need_password
= 331¶
-
-
wpull.protocol.ftp.util.
convert_machine_list_time_val
(text: str) → datetime.datetime[source]¶ Convert RFC 3659 time-val to datetime objects.
-
wpull.protocol.ftp.util.
convert_machine_list_value
(name: str, value: str) → typing.Union[source]¶ Convert sizes and time values.
Size will be
int
while time value will bedatetime.datetime
.
-
wpull.protocol.ftp.util.
machine_listings_to_file_entries
(listings: typing.Iterable) → typing.Iterable[source]¶ Convert results from parsing machine listings to FileEntry list.
-
wpull.protocol.ftp.util.
parse_machine_listing
(text: str, convert: bool=True, strict: bool=True) → typing.List[source]¶ Parse machine listing.
Parameters: - text – The listing.
- convert – Convert sizes and dates.
- strict – Method of handling errors.
True
will raiseValueError
.False
will ignore rows with errors.
Returns: A list of dict of the facts defined in RFC 3659. The key names must be lowercase. The filename uses the key
name
.Return type: list
protocol.http
Module¶
HTTP Protocol.
protocol.http.chunked
Module¶
Chunked transfer encoding.
-
class
wpull.protocol.http.chunked.
ChunkedTransferReader
(connection, read_size=4096)[source]¶ Bases:
object
Read chunked transfer encoded stream.
Parameters: connection ( connection.Connection
) – Established connection.-
read_chunk_body
()[source]¶ Read a fragment of a single chunk.
Call
read_chunk_header()
first.Returns: 2-item tuple with the content data and raw data. First item is empty bytes string when chunk is fully read. Return type: tuple Coroutine.
-
protocol.http.client
Module¶
Basic HTTP Client.
-
class
wpull.protocol.http.client.
Client
(*args, stream_factory=<class 'wpull.protocol.http.stream.Stream'>, **kwargs)[source]¶ Bases:
wpull.protocol.abstract.client.BaseClient
Stateless HTTP/1.1 client.
The session object is
Session
.
-
class
wpull.protocol.http.client.
Session
(stream_factory: typing.Callable=None, **kwargs)[source]¶ Bases:
wpull.protocol.abstract.client.BaseSession
HTTP request and response session.
-
Session.
done
() → bool[source]¶ Return whether the session was complete.
A session is complete when it has sent a request, read the response header and the response body.
-
Session.
download
(file: typing.Union=None, raw: bool=False, rewind: bool=True, duration_timeout: typing.Union=None)[source]¶ Read the response content into file.
Parameters: - file – A file object or asyncio stream.
- raw – Whether chunked transfer encoding should be included.
- rewind – Seek the given file back to its original offset after reading is finished.
- duration_timeout – Maximum time in seconds of which the entire file must be read.
Be sure to call
start()
first.Coroutine.
-
Session.
start
(request: wpull.protocol.http.request.Request) → wpull.protocol.http.request.Response[source]¶ Begin a HTTP request
Parameters: request – Request information. Returns: A response populated with the HTTP headers. Once the headers are received, call
download()
.Coroutine.
-
protocol.http.redirect
Module¶
Redirection tracking.
-
class
wpull.protocol.http.redirect.
RedirectTracker
(max_redirects=20, codes=(301, 302, 303), repeat_codes=(307, 308))[source]¶ Bases:
object
Keeps track of HTTP document URL redirects.
Parameters: - max_redirects (int) – The maximum number of redirects to allow.
- codes – The HTTP status codes indicating a redirect where the method can change to “GET”.
- repeat_codes – The HTTP status codes indicating a redirect where the method cannot change and future requests should be repeated.
-
REDIRECT_CODES
= (301, 302, 303)¶
-
REPEAT_REDIRECT_CODES
= (307, 308)¶
-
load
(response)[source]¶ Load the response and increment the counter.
Parameters: response ( http.request.Response
) – The response from a previous request.
-
next_location
(raw=False)[source]¶ Returns the next location.
Parameters: raw (bool) – If True, the original string contained in the Location field will be returned. Otherwise, the URL will be normalized to a complete URL. Returns: If str, the location. Otherwise, no next location. Return type: str, None
protocol.http.request
Module¶
HTTP conversation objects.
-
class
wpull.protocol.http.request.
RawRequest
(method=None, resource_path=None, version='HTTP/1.1')[source]¶ Bases:
wpull.protocol.abstract.request.BaseRequest
,wpull.protocol.abstract.request.SerializableMixin
,wpull.protocol.abstract.request.DictableMixin
Represents an HTTP request.
-
method
¶ str
The HTTP method in the status line. For example,
GET
,POST
.
-
resource_path
¶ str
The URL or “path” in the status line.
-
version
¶ str
The HTTP version in the status line. For example,
HTTP/1.0
.
-
fields
¶ -
The fields in the HTTP header.
-
encoding
¶ str
The encoding of the status line.
-
-
class
wpull.protocol.http.request.
Request
(url=None, method='GET', version='HTTP/1.1')[source]¶ Bases:
wpull.protocol.http.request.RawRequest
Represents a higher level of HTTP request.
-
address
¶ tuple
An address tuple suitable for
socket.connect()
.
-
username
¶ str
Username for HTTP authentication.
-
password
¶ str
Password for HTTP authentication.
-
-
class
wpull.protocol.http.request.
Response
(status_code=None, reason=None, version='HTTP/1.1', request=None)[source]¶ Bases:
wpull.protocol.abstract.request.BaseResponse
,wpull.protocol.abstract.request.SerializableMixin
,wpull.protocol.abstract.request.DictableMixin
Represents the HTTP response.
-
status_code
¶ int
The status code in the status line.
-
status_reason
¶ str
The status reason string in the status line.
-
version
¶ str
The HTTP version in the status line. For example,
HTTP/1.1
.
-
fields
¶ -
The fields in the HTTP headers (and trailer, if present).
-
request
¶ The corresponding request.
-
encoding
¶ str
The encoding of the status line.
-
classmethod
parse_status_line
(data)[source]¶ Parse the status line bytes.
Returns: An tuple representing the version, code, and reason. Return type: tuple
-
protocol
¶
-
protocol.http.robots
Module¶
Robots.txt file logistics.
-
exception
wpull.protocol.http.robots.
NotInPoolError
[source]¶ Bases:
Exception
The URL is not in the pool.
-
class
wpull.protocol.http.robots.
RobotsTxtChecker
(web_client: wpull.protocol.http.web.WebClient=None, robots_txt_pool: wpull.robotstxt.RobotsTxtPool=None)[source]¶ Bases:
object
Robots.txt file fetcher and checker.
Parameters: - web_client – Web Client.
- robots_txt_pool – Robots.txt Pool.
-
can_fetch
(request: wpull.protocol.http.request.Request, file=None) → bool[source]¶ Return whether the request can fetched.
Parameters: - request – Request.
- file – A file object to where the robots.txt contents are written.
Coroutine.
-
can_fetch_pool
(request: wpull.protocol.http.request.Request)[source]¶ Return whether the request can be fetched based on the pool.
-
fetch_robots_txt
(request: wpull.protocol.http.request.Request, file=None)[source]¶ Fetch the robots.txt file for the request.
Coroutine.
-
robots_txt_pool
¶ Return the RobotsTxtPool.
-
web_client
¶ Return the WebClient.
protocol.http.stream
Module¶
HTML protocol streamers.
-
wpull.protocol.http.stream.
DEFAULT_NO_CONTENT_CODES
= frozenset({100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 204, 304})¶ Status codes where a response body is prohibited.
-
class
wpull.protocol.http.stream.
Stream
(connection, keep_alive=True, ignore_length=False)[source]¶ Bases:
object
HTTP stream reader/writer.
Parameters: - connection (
connection.Connection
) – An established connection. - keep_alive (bool) – If True, use HTTP keep-alive.
- ignore_length (bool) – If True, Content-Length headers will be ignored. When using this option, keep_alive should be False.
-
connection
¶ The underlying connection.
-
connection
-
data_event_dispatcher
¶
-
classmethod
get_read_strategy
(response)[source]¶ Return the appropriate algorithm of reading response.
Returns: chunked
,length
,close
.Return type: str
-
read_body
(request, response, file=None, raw=False)[source]¶ Read the response’s content body.
Coroutine.
- connection (
-
wpull.protocol.http.stream.
is_no_body
(request, response, no_content_codes=frozenset({100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 204, 304}))[source]¶ Return whether a content body is not expected.
protocol.http.util
Module¶
Miscellaneous HTTP functions.
protocol.http.web
Module¶
Advanced HTTP Client handling.
-
class
wpull.protocol.http.web.
LoopType
[source]¶ Bases:
enum.Enum
Indicates the type of request and response.
-
authentication
= None¶ Response to a HTTP authentication.
-
normal
= None¶ Normal response.
-
redirect
= None¶ Redirect.
-
robots
= None¶ Response to a robots.txt request.
-
-
class
wpull.protocol.http.web.
WebClient
(http_client: typing.Union=None, request_factory: typing.Callable=<class 'wpull.protocol.http.request.Request'>, redirect_tracker_factory: typing.Union=<class 'wpull.protocol.http.redirect.RedirectTracker'>, cookie_jar: typing.Union=None)[source]¶ Bases:
object
A web client handles redirects, cookies, basic authentication.
Parameters: - An HTTP client. (http_client.) –
- requets_factory – A function that returns a new
http.request.Request
- redirect_tracker_factory – A function that returns a new
http.redirect.RedirectTracker
- cookie_jar – A cookie jar.
Return the Cookie Jar.
-
http_client
¶ Return the HTTP Client.
-
redirect_tracker_factory
¶ Return the Redirect Tracker factory.
-
request_factory
¶ Return the Request factory.
-
session
(request: wpull.protocol.http.request.Request) → wpull.protocol.http.web.WebSession[source]¶ Return a fetch session.
Parameters: request – The request to be fetched. Example usage:
client = WebClient() session = client.session(Request('http://www.example.com')) with session: while not session.done(): request = session.next_request() print(request) response = yield from session.start() print(response) if session.done(): with open('myfile.html') as file: yield from session.download(file) else: yield from session.download()
Returns: WebSession
-
class
wpull.protocol.http.web.
WebSession
(request: wpull.protocol.http.request.Request, http_client: wpull.protocol.http.client.Client, redirect_tracker: wpull.protocol.http.redirect.RedirectTracker, request_factory: typing.Callable, cookie_jar: typing.Union=None)[source]¶ Bases:
object
A web session.
-
done
() → bool[source]¶ Return whether the session has finished.
Returns: If True, the document has been fully fetched. Return type: bool
-
download
(file: typing.Union=None, duration_timeout: typing.Union=None)[source]¶ Download content.
Parameters: - file – An optional file object for the document contents.
- duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns: An instance of
http.request.Response
.Return type: See
WebClient.session()
for proper usage of this function.Coroutine.
-
loop_type
() → wpull.protocol.http.web.LoopType[source]¶ Return the type of response.
Seealso: LoopType
.
-
redirect_tracker
¶ Return the Redirect Tracker.
-
proxy
Module¶
proxy.client
Module¶
Proxy support for HTTP requests.
-
class
wpull.proxy.client.
HTTPProxyConnectionPool
(proxy_address, *args, proxy_ssl=False, authentication=None, ssl_context=True, host_filter=None, **kwargs)[source]¶ Bases:
wpull.network.pool.ConnectionPool
Establish pooled connections to a HTTP proxy.
Parameters: - proxy_address (tuple) – Tuple containing host and port of the proxy server.
- connection_pool (
connection.ConnectionPool
) – Connection pool - proxy_ssl (bool) – Whether to connect to the proxy using HTTPS.
- authentication (tuple) – Tuple containing username and password.
- ssl_context – SSL context for SSL connections on TCP tunnels.
- host_filter (
proxy.hostfilter.HostFilter
) – Host filter which for deciding whether a connection is routed through the proxy. A test result that returns True is routed through the proxy.
proxy.hostfilter
Module¶
Host filtering.
proxy.server
Module¶
Proxy Tools
-
class
wpull.proxy.server.
HTTPProxyServer
(http_client: wpull.protocol.http.client.Client)[source]¶ Bases:
wpull.application.hook.HookableMixin
HTTP proxy server for use with man-in-the-middle recording.
This function is meant to be used as a callback:
asyncio.start_server(HTTPProxyServer(HTTPClient))
Parameters: http_client ( http.client.Client
) – The HTTP client.-
request_callback
¶ A callback function that accepts a Request.
-
pre_response_callback
¶ A callback function that accepts a Request and Response
-
response_callback
¶ A callback function that accepts a Request and Response
-
regexstream
Module¶
Regular expression streams.
-
class
wpull.regexstream.
RegexStream
(file, pattern, read_size=16384, overlap_size=4096)[source]¶ Bases:
object
Streams file with regular expressions.
Parameters: - file – File object.
- pattern – A compiled regular expression object.
- read_size (int) – The size of a chunk of text that is searched.
- overlap_size (int) – The amount of overlap between chunks of text that is searched.
resmon
Module¶
Resource monitor.
-
wpull.resmon.
ResourceInfo
¶ Resource level information
-
wpull.resmon.
path
¶ str, None
File path of the resource.
None
is provided for memory usage.
-
wpull.resmon.
free
¶ int
Number of bytes available.
-
wpull.resmon.
limit
¶ int
Minimum bytes of the resource.
alias of
ResourceInfoType
-
-
class
wpull.resmon.
ResourceMonitor
(resource_paths=('/', ), min_disk=10000, min_memory=10000)[source]¶ Bases:
object
Monitor available resources such as disk space and memory.
Parameters: - resource_paths (list) – List of paths to monitor. Recommended paths include temporary directories and the current working directory.
- min_disk (int, optional) – Minimum disk space in bytes.
- min_memory (int, optional) – Minimum memory in bytes.
robotstxt
Module¶
Robots.txt exclusion directives.
scraper
Module¶
Document scrapers.
scraper.base
Module¶
Base classes
-
class
wpull.scraper.base.
BaseExtractiveScraper
[source]¶ Bases:
wpull.scraper.base.BaseScraper
,wpull.document.base.BaseExtractiveReader
-
class
wpull.scraper.base.
BaseHTMLScraper
[source]¶ Bases:
wpull.scraper.base.BaseScraper
,wpull.document.base.BaseHTMLReader
-
class
wpull.scraper.base.
BaseScraper
[source]¶ Bases:
object
Base class for scrapers.
-
scrape
(request, response, link_type=None)[source]¶ Extract the URLs from the document.
Parameters: - request (
http.request.Request
) – The request. - response (
http.request.Response
) – The response. - link_type – A value from
item.LinkType
.
Returns: LinkContexts and document information.
If None, then the scraper does not support scraping the document.
Return type: ScrapeResult, None
- request (
-
-
class
wpull.scraper.base.
BaseTextStreamScraper
[source]¶ Bases:
wpull.scraper.base.BaseScraper
,wpull.document.base.BaseTextStreamReader
Base class for scrapers that process either link and non-link text.
-
iter_processed_links
(file, encoding=None, base_url=None, context=False)[source]¶ Return the links.
This function is a convenience function for calling
iter_processed_text()
and returning only the links.
-
iter_processed_text
(file, encoding=None, base_url=None)[source]¶ Return the file text and processed absolute links.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
- base_url (str) – The URL at which the document is located.
Returns: Each item is a tuple:
- str: The text
- bool: Whether the text a link
Return type: iterator
-
-
class
wpull.scraper.base.
DemuxDocumentScraper
(document_scrapers)[source]¶ Bases:
wpull.scraper.base.BaseScraper
Puts multiple Document Scrapers into one.
-
scrape
(request, response, link_type=None)[source]¶ Iterate the scrapers, returning the first of the results.
-
scrape_info
(request, response, link_type=None)[source]¶ Iterate the scrapers and return a dict of results.
Returns: A dict where the keys are the scrapers instances and the values are the results. That is, a mapping from BaseDocumentScraper
toScrapeResult
.Return type: dict
-
-
wpull.scraper.base.
LinkContext
¶ A named tuple describing a scraped link.
-
wpull.scraper.base.
link
¶ str
The link that was scraped.
-
wpull.scraper.base.
inline
¶ bool
Whether the link is an embeded object.
-
wpull.scraper.base.
linked
¶ bool
Whether the link links to another page.
-
wpull.scraper.base.
link_type
¶ A value from
item.LinkType
.
-
wpull.scraper.base.
extra
¶ Any extra info.
alias of
LinkContextType
-
-
class
wpull.scraper.base.
ScrapeResult
(link_contexts, encoding)[source]¶ Bases:
dict
Links scraped from a document.
This class is subclassed from
dict
and contains convenience methods.-
encoding
¶ Character encoding of the document.
-
inline
¶ Link Context of objects embedded in the document.
-
inline_links
¶ URLs of objects embedded in the document.
-
link_contexts
¶ Link Contexts.
-
linked
¶ Link Context of objects linked from the document
-
linked_links
¶ URLs of objects linked from the document
-
scraper.css
Module¶
Stylesheet scraper.
-
class
wpull.scraper.css.
CSSScraper
(encoding_override=None)[source]¶ Bases:
wpull.document.css.CSSReader
,wpull.scraper.base.BaseTextStreamScraper
Scrapes CSS stylesheet documents.
scraper.html
Module¶
HTML link extractor.
-
class
wpull.scraper.html.
ElementWalker
(css_scraper=None, javascript_scraper=None)[source]¶ Bases:
object
-
ATTR_HTML
= 2¶ Flag for links that point to other documents.
-
ATTR_INLINE
= 1¶ Flag for embedded objects (like images, stylesheets) in documents.
-
DYNAMIC_ATTRIBUTES
= ('onkey', 'oncli', 'onmou')¶ Attributes that contain JavaScript.
-
LINK_ATTRIBUTES
= frozenset({'usemap', 'data', 'href', 'profile', 'action', 'dynsrc', 'classid', 'codebase', 'cite', 'longdesc', 'lowsrc', 'archive', 'background', 'src'})¶ HTML element attributes that may contain links.
-
OPEN_GRAPH_LINK_NAMES
= ('og:url', 'twitter:player')¶ Iterate elements looking for links.
Parameters: - css_scraper (
scraper.css.CSSScraper
) – Optional CSS scraper. - ( (javascript_scraper) – class:`.scraper.javascript.JavaScriptScraper): Optional JavaScript scraper.
- css_scraper (
-
OPEN_GRAPH_MEDIA_NAMES
= ('og:image', 'og:audio', 'og:video', 'twitter:image:src', 'twitter:image0', 'twitter:image1', 'twitter:image2', 'twitter:image3', 'twitter:player:stream')¶
-
TAG_ATTRIBUTES
= {'bgsound': {'src': 1}, 'body': {'background': 1}, 'input': {'src': 1}, 'area': {'href': 2}, 'iframe': {'src': 3}, 'applet': {'code': 1}, 'script': {'src': 1}, 'embed': {'href': 2, 'src': 3}, 'overlay': {'src': 3}, 'a': {'href': 2}, 'object': {'data': 1}, 'form': {'action': 2}, 'table': {'background': 1}, 'th': {'background': 1}, 'td': {'background': 1}, 'layer': {'src': 3}, 'fig': {'src': 1}, 'frame': {'src': 3}, 'img': {'href': 1, 'lowsrc': 1, 'src': 1}}¶ Mapping of element tag names to attributes containing links.
-
classmethod
is_html_link
(tag, attribute)[source]¶ Return whether the link is likely to be external object.
-
classmethod
is_link_inline
(tag, attribute)[source]¶ Return whether the link is likely to be inline object.
-
iter_links
(elements)[source]¶ Iterate the document root for links.
Returns: A iterator of LinkedInfo
.Return type: iterable
-
iter_links_by_js_attrib
(attrib_name, attrib_value)[source]¶ Iterate links of a JavaScript pseudo-link attribute.
-
iter_links_link_element
(element)[source]¶ Iterate a
link
for URLs.This function handles stylesheets and icons in addition to standard scraping rules.
-
classmethod
iter_links_meta_element
(element)[source]¶ Iterate the
meta
element for links.This function handles refresh URLs.
-
-
class
wpull.scraper.html.
HTMLScraper
(html_parser, element_walker, followed_tags=None, ignored_tags=None, robots=False, only_relative=False, encoding_override=None)[source]¶ Bases:
wpull.document.html.HTMLReader
,wpull.scraper.base.BaseHTMLScraper
Scraper for HTML documents.
Parameters: - (class (element_walker) – .document.htmlparse.base.BaseParser): An HTML parser such as the lxml or html5lib one.
- (class – ElementWalker): HTML element walker.
- followed_tags – A list of tags that should be scraped
- ignored_tags – A list of tags that should not be scraped
- robots – If True, discard any links if they cannot be followed
- only_relative – If True, discard any links that are not absolute paths
-
wpull.scraper.html.
LinkInfo
¶ Information about a link in a lxml document.
-
wpull.scraper.html.
element
¶ An instance of
document.HTMLReadElement
.
-
wpull.scraper.html.
tag
¶ str
The element tag name.
-
wpull.scraper.html.
attrib
¶ str, None
If
str
, the name of the attribute. Otherwise, the link was found inelement.text
.
-
wpull.scraper.html.
link
¶ str
The link found.
-
wpull.scraper.html.
inline
¶ bool
Whether the link is an embedded object (like images or stylesheets).
-
wpull.scraper.html.
linked
¶ bool
Whether the link is a link to another page.
-
wpull.scraper.html.
base_link
¶ str, None
The base URL.
-
wpull.scraper.html.
value_type
¶ str
Indicates how the link was found. Possible values are
plain
: The link was found plainly in an attribute value.list
: The link was found in a space separated list.css
: The link was found in a CSS text.refresh
: The link was found in a refresh meta string.script
: The link was found in JavaScript text.srcset
: The link was found in asrcset
attribute.
-
wpull.scraper.html.
link_type
¶ A value from
item.LinkInfo
.
alias of
LinkInfoType
-
scraper.javascript
Module¶
Javascript scraper.
-
class
wpull.scraper.javascript.
JavaScriptScraper
(encoding_override=None)[source]¶ Bases:
wpull.document.javascript.JavaScriptReader
,wpull.scraper.base.BaseTextStreamScraper
Scrapes JavaScript documents.
scraper.sitemap
Module¶
Sitemap scraper
-
class
wpull.scraper.sitemap.
SitemapScraper
(html_parser, encoding_override=None)[source]¶ Bases:
wpull.document.sitemap.SitemapReader
,wpull.scraper.base.BaseExtractiveScraper
Scrape Sitemaps
scraper.util
Module¶
Misc functions.
-
wpull.scraper.util.
clean_link_soup
(link)[source]¶ Strip whitespace from a link in HTML soup.
Parameters: link (str) – A string containing the link with lots of whitespace. The link is split into lines. For each line, leading and trailing whitespace is removed and tabs are removed throughout. The lines are concatenated and returned.
For example, passing the
href
value of:<a href=" http://example.com/ blog/entry/ how smaug stole all the bitcoins.html ">
will return
http://example.com/blog/entry/how smaug stole all the bitcoins.html
.Returns: The cleaned link. Return type: str
-
wpull.scraper.util.
identify_link_type
(filename)[source]¶ Return link type guessed by filename extension.
Returns: A value from item.LinkType
.Return type: str
-
wpull.scraper.util.
is_likely_link
(text)[source]¶ Return whether the text is likely to be a link.
This function assumes that leading/trailing whitespace has already been removed.
Returns: bool
stats
Module¶
Statistics.
-
class
wpull.stats.
Statistics
(url_table: typing.Union=None)[source]¶ Bases:
object
Statistics.
-
start_time
¶ float
Timestamp when the engine started.
-
stop_time
¶ float
Timestamp when the engine stopped.
-
files
¶ int
Number of files downloaded.
-
size
¶ int
Size of files in bytes.
-
errors
¶ a Counter mapping error types to integer.
-
quota
¶ int
Threshold of number of bytes when the download quota is exceeded.
-
bandwidth_meter
¶ network.BandwidthMeter
The bandwidth meter.
-
duration
¶ Return the time in seconds the interval.
-
increment
(size: int)[source]¶ Increment the number of files downloaded.
Parameters: size – The size of the file
-
is_quota_exceeded
¶ Return whether the quota is exceeded.
-
string
Module¶
String and binary data functions.
-
wpull.string.
coerce_str_to_ascii
(string)[source]¶ Force the contents of the string to be ASCII.
Anything not ASCII will be replaced with with a replacement character.
Deprecated since version 0.1002: Use
printable_str()
instead.
-
wpull.string.
detect_encoding
(data, encoding=None, fallback='latin1', is_html=False)[source]¶ Detect the character encoding of the data.
Returns: The name of the codec
Return type: str
Raises: ValueError
– The codec could not be detected. This error can only- occur if fallback is not a “lossless” codec.
-
wpull.string.
format_size
(num, format_str='{num:.1f} {unit}')[source]¶ Format the file size into a human readable text.
-
wpull.string.
normalize_codec_name
(name)[source]¶ Return the Python name of the encoder/decoder
Returns: str, None
-
wpull.string.
printable_bytes
(data)[source]¶ Remove any bytes that is not printable ASCII.
This function is intended for sniffing content types such as UTF-16 encoded text.
-
wpull.string.
printable_str
(text, keep_newlines=False)[source]¶ Escape any control or non-ASCII characters from string.
This function is intended for use with strings from an untrusted source such as writing to a console or writing to logs. It is designed to prevent things like ANSI escape sequences from showing.
Use
repr()
orascii()
instead for things such as Exception messages.
url
Module¶
URL parsing based on WHATWG URL living standard.
-
wpull.url.
C0_CONTROL_SET
= frozenset({'\x02', '\x17', '\x1c', '\x01', '\x1f', '\x1a', '\x04', '\x15', '\x1b', '\x1e', '\x1d', '\x07', '\r', '\x05', '\x10', '\x0e', '\x16', '\x18', '\x00', '\n', '\x14', '\x06', '\x12', '\x0c', '\x13', '\x19', '\x08', '\x03', '\x0f', '\t', '\x0b', '\x11'})¶ Characters from 0x00 to 0x1f inclusive
-
wpull.url.
DEFAULT_ENCODE_SET
= frozenset({32, 96, 34, 35, 60, 62, 63})¶ Percent encoding set as defined by WHATWG URL living standard.
Does not include U+0000 to U+001F nor U+001F or above.
-
wpull.url.
FORBIDDEN_HOSTNAME_CHARS
= frozenset({'\\', ':', '#', ' ', '%', ']', '?', '@', '/', '['})¶ Forbidden hostname characters.
Does not include non-printing characters. Meant for ASCII.
-
wpull.url.
FRAGMENT_ENCODE_SET
= frozenset({32, 96, 34, 60, 62})¶ Encoding set for fragment.
-
wpull.url.
PASSWORD_ENCODE_SET
= frozenset({32, 96, 34, 35, 64, 47, 60, 92, 62, 63})¶ Encoding set for passwords.
-
class
wpull.url.
PercentEncoderMap
(encode_set)[source]¶ Bases:
collections.defaultdict
Helper map for percent encoding.
-
wpull.url.
QUERY_ENCODE_SET
= frozenset({96, 34, 35, 60, 62})¶ Encoding set for query strings.
This set does not include U+0020 (space) so it can be replaced with U+0043 (plus sign) later.
-
wpull.url.
QUERY_VALUE_ENCODE_SET
= frozenset({96, 34, 35, 37, 38, 43, 60, 62})¶ Encoding set for a query value.
-
class
wpull.url.
URLInfo
[source]¶ Bases:
object
Represent parts of a URL.
-
raw
¶ str
Original string.
-
scheme
¶ str
Protocol (for example, HTTP, FTP).
str
Raw userinfo and host.
-
path
¶ str
Location of resource. This value always begins with a slash (
/
).
-
query
¶ str
Additional request parameters.
-
fragment
¶ str
Named anchor of a document.
-
userinfo
¶ str
Raw username and password.
-
username
¶ str
Username.
-
password
¶ str
Password.
-
host
¶ str
Raw hostname and port.
-
hostname
¶ str
Hostname or IP address.
-
port
¶ int
IP address port number.
-
resource
¶ int
Raw path, query, and fragment. This value always begins with a slash (
/
).
-
query_map
¶ dict
Mapping of the query. Values are lists.
-
url
¶ str
A normalized URL without userinfo and fragment.
-
encoding
¶ str
Codec name for IRI support.
If scheme is not something like HTTP or FTP, the remaining attributes are None.
All attributes are read only.
For more information about how the URL parts are derived, see https://medialize.github.io/URI.js/about-uris.html
-
authority
-
encoding
-
fragment
-
host
-
hostname
-
hostname_with_port
¶ Return the host portion but omit default port if needed.
-
classmethod
parse
(url, default_scheme='http', encoding='utf-8')[source]¶ Parse a URL and return a URLInfo.
Parse the authority part and return userinfo and host.
-
password
-
path
-
port
-
query
-
query_map
-
raw
-
resource
-
scheme
-
split_path
()[source]¶ Return the directory and filename from the path.
The results are not percent-decoded.
-
url
-
userinfo
-
username
-
-
wpull.url.
USERNAME_ENCODE_SET
= frozenset({32, 96, 34, 35, 64, 47, 58, 60, 92, 62, 63})¶ Encoding set for usernames.
-
wpull.url.
flatten_path
(path, flatten_slashes=False)[source]¶ Flatten an absolute URL path by removing the dot segments.
urllib.parse.urljoin()
has some support for removing dot segments, but it is conservative and only removes them as needed.Parameters: - path (str) – The URL path.
- flatten_slashes (bool) – If True, consecutive slashes are removed.
The path returned will always have a leading slash.
-
wpull.url.
is_subdir
(base_path, test_path, trailing_slash=False, wildcards=False)[source]¶ Return whether the a path is a subpath of another.
Parameters: - base_path – The base path
- test_path – The path which we are testing
- trailing_slash – If True, the trailing slash is treated with importance.
For example,
/images/
is a directory while/images
is a file. - wildcards – If True, globbing wildcards are matched against paths
-
wpull.url.
normalize
(url, **kwargs)[source]¶ Normalize a URL.
This function is a convenience function that is equivalent to:
>>> URLInfo.parse('http://example.com').url 'http://example.com'
Seealso: URLInfo.parse()
.
-
wpull.url.
normalize_fragment
(text, encoding='utf-8')[source]¶ Normalize a fragment.
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
normalize_hostname
(hostname)[source]¶ Normalizes a hostname so that it is ASCII and valid domain name.
-
wpull.url.
normalize_password
(text, encoding='utf-8')[source]¶ Normalize a password
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
normalize_path
(path, encoding='utf-8')[source]¶ Normalize a path string.
Flattens a path by removing dot parts, percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
normalize_query
(text, encoding='utf-8')[source]¶ Normalize a query string.
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
normalize_username
(text, encoding='utf-8')[source]¶ Normalize a username
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.
parse_url_or_log
(url, encoding='utf-8')[source]¶ Parse and return a URLInfo.
This function logs a warning if the URL cannot be parsed and returns None.
-
wpull.url.
percent_encode
(text, encode_set=frozenset({32, 96, 34, 35, 60, 62, 63}), encoding='utf-8')[source]¶ Percent encode text.
Unlike Python’s
quote
, this function accepts a blacklist instead of a whitelist of safe characters.
-
wpull.url.
percent_encode_plus
(text, encode_set=frozenset({96, 34, 35, 60, 62}), encoding='utf-8')[source]¶ Percent encode text for query strings.
Unlike Python’s
quote_plus
, this function accepts a blacklist instead of a whitelist of safe characters.
-
wpull.url.
query_to_map
(text)[source]¶ Return a key-values mapping from a query string.
Plus symbols are replaced with spaces.
-
wpull.url.
schemes_similar
(scheme1, scheme2)[source]¶ Return whether URL schemes are similar.
This function considers the following schemes to be similar:
- HTTP and HTTPS
-
wpull.url.
split_query
(qs, keep_blank_values=False)[source]¶ Split the query string.
Note for empty values: If an equal sign (
=
) is present, the value will be an empty string (''
). Otherwise, the value will beNone
:>>> list(split_query('a=&b', keep_blank_values=True)) [('a', ''), ('b', None)]
No processing is done on the actual values.
urlfilter
Module¶
URL filters.
-
class
wpull.urlfilter.
BackwardDomainFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Return whether the hostname matches a list of hostname suffixes.
-
class
wpull.urlfilter.
BackwardFilenameFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that match the filename suffixes.
-
class
wpull.urlfilter.
BaseURLFilter
[source]¶ Bases:
object
Base class for URL filters.
The Processor uses filters to determine whether a URL should be downloaded.
-
class
wpull.urlfilter.
DemuxURLFilter
(url_filters: typing.Iterator)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Puts multiple url filters into one.
-
test_info
(url_info, url_table_record) → dict[source]¶ Returns info about which filters passed or failed.
Returns: A dict containing the keys: verdict
(bool): Whether all the tests passed.passed
(set): A set of URLFilters that passed.failed
(set): A set of URLFilters that failed.map
(dict): A mapping from URLFilter class name (str) to the verdict (bool).
Return type: dict
-
url_filters
¶
-
-
class
wpull.urlfilter.
DirectoryFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that match a directory path part.
-
class
wpull.urlfilter.
FollowFTPFilter
(follow=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Follow links to FTP URLs.
-
class
wpull.urlfilter.
HTTPSOnlyFilter
[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Allow URL if the URL is HTTPS.
-
class
wpull.urlfilter.
HostnameFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Return whether the hostname matches exactly in a list.
-
class
wpull.urlfilter.
LevelFilter
(max_depth, inline_max_depth=5)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Allow URLs up to a level of recursion.
-
class
wpull.urlfilter.
ParentFilter
[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that descend up parent paths.
-
class
wpull.urlfilter.
RecursiveFilter
(enabled=False, page_requisites=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Return
True
if recursion is used.
-
class
wpull.urlfilter.
RegexFilter
(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that match a regular expression.
-
class
wpull.urlfilter.
SchemeFilter
(allowed=('http', 'https', 'ftp'))[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Allow URL if the URL is in list.
-
class
wpull.urlfilter.
SpanHostsFilter
(hostnames, enabled=False, page_requisites=False, linked_pages=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Filter URLs that go to other hostnames.
-
class
wpull.urlfilter.
TriesFilter
(max_tries)[source]¶ Bases:
wpull.urlfilter.BaseURLFilter
Allow URLs that have been attempted up to a limit of tries.
urlrewrite
Module¶
URL rewriting.
util
Module¶
Miscellaneous functions.
-
class
wpull.util.
ASCIIStreamWriter
(stream, errors='backslashreplace')[source]¶ Bases:
codecs.StreamWriter
A Stream Writer that encodes everything to ASCII.
By default, the replacement character is a Python backslash sequence.
-
DEFAULT_ERROR
= 'backslashreplace'¶
-
-
class
wpull.util.
GzipPickleStream
(filename=None, file=None, mode='rb', **kwargs)[source]¶ Bases:
wpull.util.PickleStream
gzip compressed pickle stream.
-
class
wpull.util.
PickleStream
(filename=None, file=None, mode='rb', protocol=3)[source]¶ Bases:
object
Pickle stream helper.
-
wpull.util.
filter_pem
(data)[source]¶ Processes the bytes for PEM certificates.
Returns: set
containing each certificate
-
wpull.util.
get_exception_message
(instance)[source]¶ Try to get the exception message or the class name.
-
wpull.util.
get_package_data
(filename, mode='rb')[source]¶ Return the contents of a real file or a zip file.
-
wpull.util.
get_package_filename
(filename, package_dir=None)[source]¶ Return the filename of the data file.
-
wpull.util.
grouper
(iterable, n, fillvalue=None)[source]¶ Collect data into fixed-length chunks or blocks
-
wpull.util.
parse_iso8601_str
(string)[source]¶ Parse a fixed ISO8601 datetime string.
Note
This function only parses dates in the format
%Y-%m-%dT%H:%M:%SZ
. You must use a library likedateutils
to properly parse dates and times.Returns: A UNIX timestamp. Return type: float
version
Module¶
Version information.
-
wpull.version.
__version__
¶ A string conforming to Semantic Versioning Guidelines
-
wpull.version.
version_info
¶ A tuple in the same format of
sys.version_info
waiter
Module¶
Delays between requests.
-
class
wpull.waiter.
LinearWaiter
(wait=0.0, random_wait=False, max_wait=10.0)[source]¶ Bases:
wpull.waiter.Waiter
A linear back-off waiter.
Parameters: - wait – The normal delay time
- random_wait – If True, randomly perturb the delay time within a factor of 0.5 and 1.5
- max_wait – The maximum delay time
This waiter will increment by values of 1 second.
warc
Module¶
warc.format
Module¶
WARC format.
For the WARC file specification, see http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf.
For the CDX specifications, see https://archive.org/web/researcher/cdx_file_format.php and https://github.com/internetarchive/CDX-Writer.
-
class
wpull.warc.format.
WARCRecord
[source]¶ Bases:
object
A record in a WARC file.
-
fields
¶ An instance of
namevalue.NameValueRecord
.
-
block_file
¶ A file object. May be None.
-
CONTENT_TYPE
= 'Content-Type'¶
-
NAME_OVERRIDES
= frozenset({'Content-Length', 'WARC-Date', 'Content-Type', 'WARC-Warcinfo-ID', 'WARC-Segment-Origin-ID', 'WARC-Segment-Number', 'WARC-Block-Digest', 'WARC-Identified-Payload-Type', 'WARC-Refers-To', 'WARC-Target-URI', 'WARC-Type', 'WARC-Profile', 'WARC-Segment-Total-Length', 'WARC-Payload-Digest', 'WARC-Truncated', 'WARC-Record-ID', 'WARC-Concurrent-To', 'WARC-IP-Address', 'WARC-Filename'})¶ Field name case normalization overrides because hanzo’s warc-tools do not adequately conform to specifications.
-
REQUEST
= 'request'¶
-
RESPONSE
= 'response'¶
-
REVISIT
= 'revisit'¶
-
SAME_PAYLOAD_DIGEST_URI
= 'http://netpreserve.org/warc/1.0/revisit/identical-payload-digest'¶
-
TYPE_REQUEST
= 'application/http;msgtype=request'¶
-
TYPE_RESPONSE
= 'application/http;msgtype=response'¶
-
VERSION
= 'WARC/1.0'¶
-
WARCINFO
= 'warcinfo'¶
-
WARC_DATE
= 'WARC-Date'¶
-
WARC_FIELDS
= 'application/warc-fields'¶
-
WARC_RECORD_ID
= 'WARC-Record-ID'¶
-
WARC_TYPE
= 'WARC-Type'¶
-
compute_checksum
(payload_offset: typing.Union=None)[source]¶ Compute and add the checksum data to the record fields.
This function also sets the content length.
-
get_http_header
() → wpull.protocol.http.request.Response[source]¶ Return the HTTP header.
It only attempts to read the first 4 KiB of the payload.
Returns: Returns an instance of http.request.Response
or None.Return type: Response, None
-
set_common_fields
(warc_type: str, content_type: str)[source]¶ Set the required fields for the record.
-
warc.recorder
Module¶
-
class
wpull.warc.recorder.
BaseWARCRecorderSession
(recorder, temp_dir=None, url_table=None)[source]¶ Bases:
object
Base WARC recorder session.
-
class
wpull.warc.recorder.
FTPWARCRecorderSession
(*args, **kwargs)[source]¶ Bases:
wpull.warc.recorder.BaseWARCRecorderSession
FTP WARC Recorder Session.
-
class
wpull.warc.recorder.
HTTPWARCRecorderSession
(*args, **kwargs)[source]¶ Bases:
wpull.warc.recorder.BaseWARCRecorderSession
HTTP WARC Recorder Session.
-
class
wpull.warc.recorder.
WARCRecorder
(filename, params=None)[source]¶ Bases:
object
Record to WARC file.
Parameters: - filename (str) – The filename (without the extension).
- params (
WARCRecorderParams
) – Parameters.
-
CDX_DELIMINATOR
= ' '¶ Default CDX delimiter.
-
DEFAULT_SOFTWARE_STRING
= 'Wpull/2.0.1 Python/3.4.3'¶ Default software string.
-
classmethod
parse_mimetype
(value)[source]¶ Return the MIME type from a Content-Type string.
Returns: A string in the form type/subtype
or None.Return type: str, None
-
wpull.warc.recorder.
WARCRecorderParams
¶ WARCRecorder
parameters.Parameters: - compress (bool) – If True, files will be compressed with gzip
- extra_fields (list) – A list of key-value pairs containing extra metadata fields
- temp_dir (str) – Directory to use for temporary files
- log (bool) – Include the program logging messages in the WARC file
- appending (bool) – If True, the file is not overwritten upon opening
- digests (bool) – If True, the SHA1 hash digests will be written.
- cdx (bool) – If True, a CDX file will be written.
- max_size (int) – If provided, output files are named like
name-00000.ext
and the log file will be inname-meta.ext
. - move_to (str) – If provided, completed WARC files and CDX files will be moved to the given directory
- url_table (
database.URLTable
) – If given, thenrevist
records will be written. - software_string (str) – The value for the
software
field in the Warcinfo record.
alias of
WARCRecorderParamsType
writer
Module¶
Document writers.
-
class
wpull.writer.
AntiClobberFileWriter
(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseFileWriter
File writer that downloads to a new filename if the original exists.
-
session_class
¶
-
-
class
wpull.writer.
AntiClobberFileWriterSession
(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]¶
-
class
wpull.writer.
BaseFileWriter
(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseWriter
Base class for saving documents to disk.
Parameters: - path_namer – The path namer.
- file_continuing – If True, the writer will modify requests to fetch the remaining portion of the file
- headers_included – If True, the writer will include the HTTP header responses on top of the document
- local_timestamping – If True, the writer will set the Last-Modified timestamp on downloaded files
- adjust_extension – If True, HTML or CSS file extension will be added whenever it is detected as so.
- content_disposition – If True, the filename is extracted from the Content-Disposition header.
- trust_server_names – If True and there is redirection, use the last given response for the filename.
-
session_class
¶ Return the class of File Writer Session.
This should be overridden by subclasses.
-
class
wpull.writer.
BaseFileWriterSession
(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]¶ Bases:
wpull.writer.BaseWriterSession
Base class for File Writer Sessions.
-
classmethod
open_file
(filename: str, response: wpull.protocol.abstract.request.BaseResponse, mode='wb+')[source]¶ Open a file object on to the Response Body.
Parameters: - filename – The path where the file is to be saved
- response – Response
- mode – The file mode
This function will create the directories if not exist.
-
classmethod
-
class
wpull.writer.
BaseWriterSession
[source]¶ Bases:
object
Base class for a single document to be written.
-
discard_document
(response: wpull.protocol.abstract.request.BaseResponse)[source]¶ Don’t save the document.
This function is called by a Processor once the Processor deemed the document should be deleted (i.e., a “404 Not Found” response).
-
extra_resource_path
(suffix: str) → typing.Union[source]¶ Return a filename suitable for saving extra resources.
-
process_request
(request: wpull.protocol.abstract.request.BaseRequest) → wpull.protocol.abstract.request.BaseRequest[source]¶ Rewrite the request if needed.
This function is called by a Processor after it has created the Request, but before submitting it to a Client.
Returns: The original Request or a modified Request
-
-
class
wpull.writer.
IgnoreFileWriter
(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseFileWriter
File writer that ignores files that already exist.
-
session_class
¶
-
-
class
wpull.writer.
IgnoreFileWriterSession
(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]¶
-
class
wpull.writer.
MuxBody
(stream: typing.BinaryIO, **kwargs)[source]¶ Bases:
wpull.body.Body
Writes data into a second file.
-
class
wpull.writer.
NullWriter
[source]¶ Bases:
wpull.writer.BaseWriter
File writer that doesn’t write files.
-
class
wpull.writer.
OverwriteFileWriter
(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseFileWriter
File writer that overwrites files.
-
session_class
¶
-
-
class
wpull.writer.
OverwriteFileWriterSession
(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]¶
-
class
wpull.writer.
SingleDocumentWriter
(stream: typing.BinaryIO, headers_included: bool=False)[source]¶ Bases:
wpull.writer.BaseWriter
Writer that writes all the data into a single file.
-
class
wpull.writer.
SingleDocumentWriterSession
(stream: typing.BinaryIO, headers_included: bool)[source]¶ Bases:
wpull.writer.BaseWriterSession
Write all data into stream.
-
class
wpull.writer.
TimestampingFileWriter
(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseFileWriter
File writer that only downloads newer files from the server.
-
session_class
¶
-
What’s New¶
Summary of notable changes.
Unreleased¶
2.0.1 (2016-06-21)¶
- Fixed: KeyError crash when psutil was not installed.
- Fixed: AttributeError proxy error using PhantomJS due to response body not written to a file.
2.0 (2016-06-17)¶
- Removed: Lua scripting support and its Python counterpart (
--lua-script
and--python-script
). - Removed: Python 3.2 & 3.3 support.
- Removed: PyPy support.
- Changed: IP addresses are normalized to a standard notation to avoid fetching duplicates such as IPv4 addresses written in hexadecimal or long-hand IPv6 addresses.
- Changed: Scripting is now done using plugin interface via
--plugin-script
. - Fixed: Support for Python 3.5.
- Fixed: FTP unable to handle directory listing with date in MMM DD YYYY and filename containing YYYY-MM-DD text.
- Fixed: Downloads through the proxy (such as PhantomJS) now show up in the database and can be controlled through scripting.
- Fixed: NotFound error when converting links in CSS file that contain URLs that were not fetched.
- Fixed: When resuming a forcefully interrupted crawl (e.g., a crash) using a database, URLs in progress were not restarted because they were not reset in the database when the program started up.
Backwards incompatibility¶
This release contains backwards incompatible changes to the database schema and scripting interface.
If you use --database
, the database created by older versions in
Wpull cannot be used in this version.
Scripting hook code will need to be rewritten to use the new API. See the new documentation for scripting for the new style of interfacing with Wpull.
Additionally for scripts, the internal event loop has switched from Trollius to built-in Asyncio.
1.2.3 (2016-02-03)¶
- Removed: cx_freeze build support.
- Deprecated: Lua Scripting support will be removed in next release.
- Deprecated: Python 3.2 & 3.3 support will be removed in the next release.
- Deprecated: PyPy support will be removed in the next release.
- Fixed: Error when logging in with FTP to servers that don’t need a password.
- Fixed: ValueError when downloading URLs that contain unencoded unprintable characters like Zero Width Non-Joiner or Right to Left Mark.
1.2.2 (2015-10-21)¶
- Fixed:
--output-document
file doesn’t contain content. - Fixed: OverflowError when URL contains invalid port number greater than 65535 or less than 0.
- Fixed: AssertionError when saving IPv4-mapped IPv6 addresses to WARC files.
- Fixed: AttributeError when running with installed Trollius 2.0.
- Changed: The setup file no longer requires optional psutil.
1.2.1 (2015-05-15)¶
- Fixed: OverflowError with URLs with large port numbers.
- Fixed: TypeError when using standard input as input file (
--input-file -
). - Changed: using
--youtube-dl
respects--inet4-only
and--no-check-certificate
now.
1.2 (2015-04-24)¶
- Fixed: Connecting to sites with IPv4 & IPv6 support resulted in errors when IPv6 was not supported by the local network. Connections now use Happy Eyeballs Algorithm for IPv4 & IPv6 dual-stack support.
- Fixed: SQLAlchemy error with PyPy and SQLAlchemy 1.0.
- Fixed: Input URLs are not fetched in order. Regression since 1.1.
- Fixed: UnicodeEncodeError when fetching FTP files with non-ASCII filenames.
- Fixed: Session cookies not loaded when using
--load-cookies
. - Fixed:
--keep-session-cookies
was always on. - Changed: FTP communication uses UTF-8 instead of Latin-1.
- Changed:
--prefer-family=none
is now default. - Added:
none
as a choice to--prefer-family
. - Added:
--no-glob
and FTP filename glob support.
1.1.1 (2015-04-13)¶
- Changed: when using
--youtube-dl
and--warc-file
, the JSON metedata file is now saved in the WARC file compatible with pywb. - Changed: logging and progress meter to say “unspecified” instead of “none” when no content length is provided by server to match Wget.
1.1 (2015-04-03)¶
- Security: Updated certificate bundle.
- Fixed:
--regex-type
to acceptpcre
instead ofposix
. Regular expressions always use Python’s regex library. Posix regex is not supported. - Fixed: when using
--warc-max-size
and--warc-append
, it wrote to existing sequential WARC files unnecessarily. - Fixed: input URLs stored in memory instead of saved on disk. This issue was notable if there were many URLs provided by the
--input-file
option. - Changed: when using
--warc-max-size
and--warc-append
, the next sequential WARC file is created to avoid appending to corrupt files. - Changed: WARC file writing to use journal files and refuse to start program if any journals exist. This avoids corrupting files through naive use of
--warc-append
and allow for future automated recovery. - Added: Open Graph and Twitter Card element links extraction.
1.0 (2015-03-14)¶
- Fixed: a
--database
path with a question mark (?
) truncated the path, did not use a on-disk database, or causesTypeError
. The question mark is automatically replaced with an underscore. - Fixed: HTTP proxy support broken since version 0.1001.
- Added:
no_proxy
environment variable support. - Added:
--proxy-domains
,--proxy-exclude-domains
,--proxy-hostnames
,--proxy-exclude-hostnames
- Removed:
--no-secure-proxy-tunnel
.
0.1009 (2015-03-08)¶
- Added:
--preserve-permissions
. - Fixed: exit code returned as 2 instead of 1 on generic errors.
- Changed: exception tracebacks are printed only on generic errors.
- Changed: temporary WARC log file is now compressed to save space.
Scripting Hook API:
- Added: Version 3 API
- Added:
wait_time
to version 3 which provides useful context including response or error infos.
0.1008 (2015-02-26)¶
- Security: updated certificate bundle.
- Fixed: TypeError crash on bad Meta Refresh HTML element.
- Fixed: unable to fetch FTP files with spaces and other special characters.
- Fixed: AssertionError fetching URLs with trailing dot not properly removed.
- Added:
--no-cache
. - Added:
--report-speed
. - Added:
--monitor-disk
and--monitor-memory
.
0.1007 (2015-02-19)¶
- Fixed malformed URLs printed to logs without sanitation.
- Fixed AttributeError crash on FTP servers that support MLSD.
- Improved link recursion heuristics when extracting from JavaScript and HTML.
- Added
--retr-symlinks
. - Added
--session-timeout
.
0.1006.1 (2015-02-09)¶
- Security: Fixed
Referer
HTTP header field leaking from HTTPS to HTTP. - Fixed
AttributeError
in proxy when using PhantomJS andpre_response
scripting hook. - Fixed early program end when server returns error fetching robots.txt.
- Fixed uninteresting errors outputted if program is forcefully closed.
- Fixed
--referer
option not applied to subsequent requests.
0.1006 (2015-02-01)¶
- Fixed inability to fetch URLs with hostnames starting/ending with hyphen.
- Fixed “Invalid file descriptor” error in proxy server.
- Fixed FTP listing dates mistakenly parsed as future date within the same month.
- Added
--escaped-fragment
option. - Added
--strip-session-id
option. - Added
--no-skip-getaddrinfo
option. - Added
--limit-rate
option. - Added
--phantomjs-max-time
option. - Added
--youtube-dl
option. - Added
--plugin-script
option. - Improved PhantomJS stability.
0.1005 (2015-01-15)¶
- Security: SSLv2/SSLv3 is disabled for
--secure-protocol=auto
. Added--no-strong-crypto
that re-enables them again if needed. - Fixed NameError with PhantomJS proxy on Python 3.2.
- Fixed PhantomJS stop waiting for page load too early.
- Fixed “Line too long” error and remove uninteresting page errors during PhantomJS.
- Fixed
--page-requisites
exceeding--level
. - Fixed
--no-verbose
not providing informative messages and behaving like--quiet
. - Fixed infinite page requisite recursion when using
--span-hosts-allow page-requisites
. - Added
--page-requisites-level
. The default max recursion depth on page requisites is now 5. - Added
--very-quiet
. --no-verbose
is defaulted when--concurrent
is 2 or greater.
Database Schema:
- URL
inline
column is now an integer.
0.1004.2 (2015-01-03)¶
Hotfix release.
- Fixed PhantomJS mode’s MITM proxy AttributeError on certificates.
0.1004.1 (2015-01-03)¶
- Fixed TypeError crash on a bad cookie.
- Fixed PhantomJS mode’s MITM proxy SSL certificates not installed.
0.1004 (2014-12-25)¶
- Fixed FTP data connection reuse error.
- Fixed maximum recursion depth exceeded on FTP downloads.
- Fixed FTP file listing detecting dates too eagerly as ISO8601 format.
- Fixed crash on FTP if file listing could not find a date in a line.
- Fixed HTTP status code 204 “No Content” interpreted as an error.
- Fixed “cert already in hash table” error when using both OS and Wpull’s certificates.
- Improved PhantomJS stability. Timeout errors should be less frequent.
- Added
--adjust-extension
. - Added
--content-disposition
. - Added
--trust-server-names
.
0.1003 (2014-12-11)¶
- Fixed FTP fetch where code 125 was not recognized as valid.
- Fixed FTP 12 o’clock AM/PM time logic.
- Fixed URLs fetched as lowercase URLs when scheme and authority separator is not provided.
- Added
--database-uri
option to specify a SQLAlchemy URI. - Added
none
as a choice to--progress
. - Added
--user
/--password
support. - Scripting:
- Fixed missing response callback during redirects. Regression introduced in v0.1002.
0.1002 (2014-11-24)¶
- Fixed control characters printed without escaping.
- Fixed cookie size not limited correctly per domain name.
- Fixed URL parsing incorrectly allowing spaces in hostnames.
- Fixed
--sitemaps
option not respecting--no-parent
. - Fixed “Content overrun” error on broken web servers. A warning is logged instead.
- Fixed SSL verification error despite
--no-check-certificate
is specified. - Fixed crash on IPv6 URLs containing consecutive dots.
- Fixed crash attempting to connect to IPv6 addresses.
- Consecutive slashes in URL paths are now flattened.
- Fixed crash when fetching IPv6 robots.txt file.
- Added experimental FTP support.
- Switched default HTML parser to html5lib.
- Scripting:
- Added
handle_pre_response
callback hook.
- Added
- API:
- Fixed
ConnectionPool
max_host_count
argument not used. - Moved document scraping concerns from
WebProcessorSession
toProcessingRule
. - Renamed
SSLVerficationError
toSSLVerificationError
.
- Fixed
0.1001.2 (2014-10-25)¶
- Fixed ValueError crash on HTTP redirects with bad IPv6 URLs.
- Fixed AssertionError on link extraction with non-absolute URLs in “codebase” attribute.
- Fixed premature exit during an error fetching robots.txt.
- Fixed executable filename problem in setup.py for cx_Freeze builds.
0.1001.1 (2014-10-09)¶
- Fixed URLs with IPv6 addresses not including brackets when using them in host strings.
- Fixed AssertionError crash where PhantomJS crashed.
- Fixed database slowness over time.
- Cookies are now synchronized and shared with PhantomJS.
- Scripting:
- Fixed mismatched
queued_url` and ``dequeued_url
causing negative values in a counter. Issue was caused by requeued items in “error” status.
- Fixed mismatched
0.1001 (2014-09-16)¶
- Fixed
--warc-move
option which had no effect. - Fixed JavaScript scraper to not accept URLs with backslashes.
- Fixed CSS scraper to not accept URLs longer than 500 characters.
- Fixed ValueError crash in Cache when two URLs are added sequentially at the same time due to bad LinkedList key comparison.
- Fixed crash formatting text when sizes reach terabytes.
- Fixed hang which may occur with lots of connection across many hostnames.
- Support for HTTP/HTTPS proxies but no HTTPS tunnelling support. Wpull will refuse to start without the insecure override option. Note that if authentication and WARC file is enabled, the username and password is recorded into the WARC file.
- Improved database performance.
- Added
--ignore-fatal-errors
option. - Added
--http-parser
option. You can now use html5lib as the HTML parser. - Support for PyPy 2.3.1 running with Python 3.2 implementation.
- Consistent URL parsing among various Python versions.
- Added
--link-extractors
option. - Added
--debug-manhole
option. - API:
document
andscraper
were put into their own packages.- HTML parsing was put into
document.htmlparse
package. url.URLInfo
no longer supports normalizing URLs by percent decoding unreserved/safe characters.
- Scripting:
- Dropped support for Scripting API version 1.
- Database schema:
- Column
url_encoding
is removed fromurls
table.
- Column
0.1000 (2014-09-02)¶
- Dropped support for Python 2. Please file an issue if this is a problem.
- Fixed possible crash on empty content with deflate compression.
- Fixed document encoding detection on documents larger than 4096 bytes where an encoded character may have been truncated.
- Always percent-encode IRIs with UTF-8 to match de facto web browser implementation.
- HTTP headers are consistently decoded as Latin-1.
- Scripting API:
- New
queued_url
anddequeued_url
hooks contributed by mback2k.
- New
- API:
- Switched to Trollius instead of Tornado. Please use Trollius 1.0.2 alpha or greater.
- Most the of internals related to the HTTP protocol were rewritten and as a result, major components are not backwards compatible; lots of changes were made. If you happen to be using Wpull’s API, please pin your requirements to
<0.1000
if you do not want to make a migration. Please file an issue if this is a problem.
0.36.4 (2014-08-07)¶
- Fixes crash when
--save-cookies
is used with non-ASCII cookies. Cookies with non-ASCII values are discarded. - Fixed HTTP gzip compressed content not decompressed during chunked transfer of single bytes.
- Tornado 4.0 support.
- API:
- Renamed:
cookie.CookieLimitsPolicy
toDeFactoCookiePolicy
.
- Renamed:
0.36.3 (2014-07-25)¶
- Improved performance on
--database
option. SQLite now uses synchronous=NORMAL instead of FULL.
0.36.2 (2014-07-16)¶
- Fixed requirements.txt to use Tornado version less than 4.0.
0.36.1 (2014-07-16)¶
- Fixes bug where “FINISHED” message was not logged in WARC file meta log. Regression was introduced in version 0.35.
0.36 (2014-06-23)¶
- Works around
PhantomJSRPCTimedOut
errors. - Adds
--phantomjs-exe
option. - Supports extracting links from HTML
img
srcset
attribute. - API:
Builder.build()
returnsApplication
instead ofEngine
.- Callback hooks
exit_status
andfinishing_statistics
now registered onApplication
instead ofEngine
. network
module split into two modulesbandwidth
anddns
.- Adds
observer
module. phantomjs.PhantomJSRemote.page_event
renamed topage_observer
.
0.35 (2014-06-16)¶
- Adds
--warc-move
option. - Scripting:
- Default scripting version is now 2.
- API:
- Builder moved into new module builder
- Adds Application class intended for different UI in the future.
Resolver
families
parameter renamed intofamily
. It accepts values from the modulesocket
orPREFER_IPv4
/PREFER_IPv6
.- Adds
HookableMixin
. This removes the use of messy subclassing for scripting hooks.
0.34.1 (2014-05-26)¶
- Fixes crash when a URL is incorrectly formatted by Wpull. (The incorrect formatting is not fixed yet however.)
0.34 (2014-05-06)¶
- Fixes file descriptor leak with
--phantomjs
and--delete-after
. - Fixes case where robots.txt file was stuck in download loop if server was offline.
- Fixes loading of cookies file from Wget. Cookie file header checks are disabled.
- Removes unneeded
--no-strong-robots
(superseded with--no-strong-redirects
.) - Fixes
--no-phantomjs-snapshot
option not respected. - More link extraction on HTML pages with elements with
onclick
,onkeyX
,onmouseX
, anddata-
attributes. - Adds web-based debugging console with
--debug-console-port
.
0.33.2 (2014-04-29)¶
- Fixes links not resolved correctly when document includes
<base href="...">
element. - Different proxy URL rewriting for PhantomJS option.
0.33.1 (2014-04-26)¶
- Fixes
--bind_address
option not working. The option was never functional since the first release. - Fixes AttributeError crash when
--phantomjs
and--X-script
options were used. Thanks to yipdw for reporting. - Fixes
--warc-tempdir
to use the current directory by default. - Fixes bad formatting and crash on links with malformed IPv6 addresses.
- Uses more rules for link extraction from JavaScript to reduce false positives.
0.33 (2014-04-21)¶
- Fixes invalid XHTML documents not properly extracted for links.
- Fixes crash on empty page.
- Support for extracting links from JavaScript segments and files.
- Doesn’t discard extracted links if document can only be parsed partially.
- API:
- Moves
OrderedDefaultDict
fromutil
tocollections
. - Moves
DeflateDecompressor
,gzip_decompress
fromutil
todecompression
. - Moves
sleep
,TimedOut
,wait_future
,AdjustableSemaphore
fromutil
toasync
. - Moves
to_bytes
,to_str
,normalize_codec_name
,detect_encoding
,try_decoding
,format_size
,printable_bytes
,coerce_str_to_ascii
fromutil
tostring
. - Removes
extended
module.
- Moves
- Scripting:
- Adds new wait_time() callback hook function.
0.32.1 (2014-04-20)¶
- Fixes XHTML documents not properly extracted for links.
- If a server responds with content declared as Gzip, the content is checked to see if it starts with the Gzip magic number. This check avoids misreading text as Gzip streams.
0.32 (2014-04-17)¶
- Fixes crash when HTML meta refresh URL is empty.
- Fixes crash when decoding a document that is malformed later in the document. These invalid documents are not searched for links.
- Reduces CPU usage when
--debug
logging is not enabled. - Better support for detecting and differentiating XHTML and XML documents.
- Fixes converting XHTML documents where it did not write XHTML syntax.
- RSS/Atom feed
link
,url
,icon
elements are searched for links. - API:
document.detect_response_encoding()
default peek argument is lowered to reduce hanging.document.BaseDocumentDetector
is now a base class for document type detection.
0.31 (2014-04-14)¶
- Fixes issue where an early
</html>
causes link discovery to be broken and converted documents missing elements. - Fixes
--no-parent
which did not behave like Wget. This issue was noticeable with options such as--span-hosts-allow linked-pages
. - Fixes
--level
where page requisites were mistakenly not fetched if it exceeds recursion level. - Includes PhantomJS version string in WARC warcinfo record.
- User-agent string no longer includes Mozilla reference.
- Implements
--force-html
and--base
. - Cookies now are limited to approximately 4 kilobytes and a maximum of 50 cookies per domain.
- Document parsing is now streamed for better handling of large documents.
- Scripting:
- Ability to set a scripting API version.
- Scripting API version 2: Adds
record_info
argument tohandle_error
andhandle_response
.
- API:
- WARCRecorder uses new parameter object WARCRecorderParams.
document
,scraper
,converter
modules heavily modified to accommodate streaming readers.document.BaseDocumentReader.parse
was removed and replaced withread_links
.- version.version_info available.
0.30 (2014-04-06)¶
- Fixes crash on SSL handshake if connection is broken.
- DNS entries are periodically removed from cache instead of held for long times.
- Experimental cx_freeze support.
- PhantomJS:
- Fixes proxy errors with requests containing a body.
- Fixes proxy errors with occasional FileNotFoundError.
- Adds timeouts to calls.
- Viewport size is now 1200 × 1920.
- Default
--phantomjs-scroll
is now 10. - Scrolls to top of page before taking snapshot.
- API:
- URL filters moved into urlfilter module.
- Engine uses and exposes interface to AdjustableSemaphore for issue #93.
0.29 (2014-03-31)¶
- Fixes SSLVerficationError mistakenly raised during connection errors.
--span-hosts
no longer implicitly enabled on non-recursive downloads. This behavior is superseded by strong redirect logic. (Use--span-hosts-allow
to guarantee fetching of page-requisites.)- Fixes URL query strings normalized with unnecessary percent-encoding escapes. Some servers do not handle percent-encoded URLs well.
- Fixes crash handling directory paths that may contain a filename or a filename that is a directory. This crash occurs when a URL like /blog and /blog/ exists. If a directory path contains a filename, the part of the directory path is suffixed with .d. If a filename is an existing directory, the filename is suffixed with .f.
- Fixes crash when URL’s hostname contains characters that decompose to dots.
- Fixes crash when HTML document declares encoding name unknown to Python.
- Fixes stuck in loop if server returns errors on robots.txt.
- Implements
--warc-dedup
. - Implements
--ignore-length
. - Implements
--output-document
. - Implements
--http-compression
. - Supports reading HTTP compression “deflate” encoding (both zlib and raw deflate).
- Scripting:
- Adds
engine_run()
callback. - Exposes the instance factory.
- Adds
- API:
- connection:
Connection
arguments changed. UsesConnectionParams
as a parameter object.HostConnectionPool
arguments also changed. - database:
URLDBRecord
renamed toURL
.URLStrDBRecord
renamed toURLString
.
- connection:
- Schema change:
- New
visits
table.
- New
0.28 (2014-03-27)¶
- Fixes crash when redirected to malformed URL.
- Fixes
--directory-prefix
not being honored. - Fixes unnecessary high CPU usage when determining encoding of document.
- Fixes crash (GeneratorExit exception) when exiting on Python 3.4.
- Uses new internal socket connection stream system.
- Updates bundled certificates (Tue Jan 28 09:38:07 2014).
- PhantomJS:
- Fixes things not appearing in WARC files. This regression was introduced in 0.26 where PhantomJS’s disk cache was enabled. It is now disabled again.
- Fixes HTTPS proxy URL rewriting where relative URLs were not properly rewritten.
- Fixes proxy URL rewriting not working for localhost.
- Fixes unwanted
Accept-Language
header picked up from environment. The value has been overridden to*
. - Fixes
--header
options left out in requests.
- API:
- New
iostream
module. extended
module is deprecated.
- New
0.27 (2014-03-23)¶
- Fixes URLs ignored (if any) on command line when
--input-file
is specified. - Fixes crash when redirected to a URL that is not HTTP.
- Fixes crash if lxml does not recognize the document encoding name. Falls back to Latin1 if lxml does not support the encoding after massaging the encoding name.
- Fixes crash on IPv6 addresses when using scripting or external API calls.
- Fixes speed shown as “0.0 B/s” instead of “– B/s” when speed can not be calculated.
- Implements
--local-encoding
,--remote-encoding
,--no-iri
. - Implements
--https-only
. - Prints bandwidth speed statistics when exiting.
- PhantomJS:
- Implements “smart scrolling” that avoids unnecessary scrolling.
- Adds
--no-phantomjs-smart-scroll
- API:
WebProcessorSession._parse_url()
renamed toWebProcessorSession.parse_url()
0.26 (2014-03-16)¶
- Fixes crash when URLs like
http://example.com]
were encountered. - Implements
--sitemaps
. - Implements
--max-filename-length
. - Implements
--span-hosts-allow
(experimental, see issues #61, #66). - Query strings items like
?a&b
are now preserved and no longer normalized to?a=&b=
. - API:
- url.URLInfo.normalize() was removed since it was mainly used internally.
- Added url.normalize() convenience function.
- writer: safe_filename(), url_to_filename(), url_to_dir_path() were modified.
0.25 (2014-03-13)¶
- Fixes link converter not operating on the correct files when
.N
files were written. - Fixes apparent hang when Wpull is almost finished on documents with many links.
- Previously, Wpull adds all URLs to the database causing overhead processing to be done in the database. Now, only requisite URLs are added to the database.
- Implements
--restrict-file-names
. - Implements
--quota
. - Implements
--warc-max-size
. Like Wget, “max size” is not the maximum size of each WARC file but it is the threshold size to trigger a new file. Unlike Wget,request
andresponse
records are not split across WARC files. - Implements
--content-on-error
. - Supports recording scrolling actions in WARC file when PhantomJS is enabled.
- Adds the
wpull
command tobin/
. - Database schema change:
filename
column was added. - API:
- converter.py: Converters no longer use PathNamer.
- writer.py:
sanitize_file_parts()
was removed in favor of newsafe_filename()
.save_document()
returns a filename. - WebProcessor now requires a root path to be specified.
- WebProcessor initializer now takes “parameter objects”.
- Install requires new dependency:
namedlist
.
0.24 (2014-03-09)¶
- Fixes crash when document encoding could not be detected. Thanks to DopefishJustin for reporting.
- Fixes non-index files incorrectly saved where an extra directory was added as part of their path.
- URL path escaping is relaxed. This helps with servers that don’t handle percent-encoding correctly.
robots.txt
now bypasses the filters. Use--no-strong-robots
to disable this behavior.- Redirects implicitly span hosts. Use
--no-strong-redirects
to disable this behavior. - Scripting:
should_fetch()
info dict now containsreason
as a key.
0.23.1 (2014-03-07)¶
- Important: Fixes issue where URLs were downloaded repeatedly.
0.23 (2014-03-07)¶
- Fixes incorrect logic in fetching robots.txt when it redirects to another URL.
- Fixes port number not included in the HTTP Host header.
- Fixes occasional
RuntimeError
when pressing CTRL+C. - Fixes fetching URL paths containing dot segments. They are now resolved appropriately.
- Fixes ASCII progress bar not showing 100% when finished download occasionally.
- Fixes crash and improves handling of unusual document encodings and settings.
- Improves handling of links with newlines and whitespace intermixed.
- Requires beautifulsoup4 as a dependency.
- API:
util.detect_encoding()
arguments modified to accept only a single fallback and to acceptis_html
.document.get_encoding()
acceptsis_html
andpeek
arguments.
0.22.5 (2014-03-05)¶
- The ‘Refresh’ HTTP header is now scraped for URLs.
- When an error occurs during writing WARC files, the WARC file is truncated back to the last good state before crashing.
- Works around error “Reached maximum read buffer size” downloading on fast connections. Side effect is intensive CPU usage.
0.22.4 (2014-03-05)¶
- Fixes occasional error on chunked transfer encoding. Thanks to ivan for reporting.
- Fixes handling links with newlines found in HTML pages. Newlines are now stripped in links when scraping pages to better handle HTML soup.
0.22.3 (2014-03-02)¶
- Fixes another case of
AssertionError
onurl_item.is_processed
when robots.txt was enabled. - Fixes crash if a malformed gzip response was received.
- Fixes
--span-hosts
to be implicitly enabled (as with--no-robots
) if--recursive
is not supplied. This behavior unconditionally allows downloading a single file without specifying any options. It is what a user intuitively expects.
0.22.2 (2014-03-01)¶
- Improves performance on database operations. CPU usage should be less intensive.
0.22.1 (2014-02-28)¶
- Fixes handling of “204 No Content” responses.
- Fixes
AssertionError
onurl_item.is_processed
when robots.txt was enabled. - Fixes PhantomJS page scrolling to be consistent.
- Lengthens PhantomJS viewport to ensure lazy-load images are properly triggered.
- Lengthens PhantomJS paper size to reduce excessive fragmentation of blocks.
0.22 (2014-02-27)¶
- Implements
--phantomjs-scroll
and--phantomjs-wait
. - Implements saving HTML and PDF snapshots (including inside WARC file). Disable with
--no-phantomjs-snapshot
. - API: Adds PhantomJSController.
0.21.1 (2014-02-27)¶
- Fixes missing dependencies and files in
setup.py
. - For PhantomJS:
- Fixes capturing HTTPS connections .
- Fixes statistics counter.
- Supports very basic scraping of HTML. See Usage section.
0.21 (2014-02-26)¶
- Fixes Request factory not used. This resolves issues where the User Agent was not set.
- Experimental PhantomJS support. It can be enabled with
--phantomjs
. See the Usage section in the documentation for more details. - API changes:
- The
http
module was split up into smaller modules:http.client
,http.connection
,http.request
,http.util
. ChunkedTransferStreamReader
was added as a reusable abstraction.- The
web
module was moved tohttp.web
. - Added
proxy
module. - Added
phantomjs
module.
- The
0.20 (2014-02-22)¶
- Implements
--no-dns-cache
,--accept
,--reject
. - Scripting: Fixes
AttributeError
crash onhandle_error
. - Another possible fix for issue #27.
0.19.2 (2014-02-18)¶
- Fixes crash if a non-HTTP URL was found during download.
- Lua scripting: Fixes booleans, coming from Wpull, mistakenly converted to integers on Python 2
0.19.1 (2014-02-14)¶
- Fixes
--timestamping
functionality. - Fixes
--timestamping
not checking.orig
files. - Fixes HTTP handling of responses which do not return content.
0.19 (2014-02-12)¶
- Fixes files not actually being written.
- Implements
--convert-links
and--backup-converted
. - API:
HTMLScraper
functions were refactored to be class methods.ScrapedLink
was renamed toLinkInfo
.
0.18.1 (2014-02-11)¶
- Fixes error when WARC but not CDX option is specified.
- Fixes closing of the SQLite database to avoid leaving temporary database files.
0.18 (2014-02-11)¶
- Implements
--no-warc-digests
,--warc-cdx
. - Improvements on reducing CPU usage consumption.
- API: Engine and Processor interaction refactored to be asynchronous.
- The Engine and Processor classes were modified significantly.
- The Engine no longer is concerned with fetching requests.
- Requests are handled within Processors. This will benefit future Processors to allow them to make arbitrary requests during processing.
- The
RedirectTracker
was moved to a newweb
module. - A
RichClient
is implemented. It handles robots.txt, cookies, and redirect concerns. WARCRecord
was moved into a newwarc
module.
0.17.3 (2014-02-07)¶
- Fixes ca-bundle file missing during install.
- Fixes AttributeError on
retry_dns_error
.
0.17.2 (2014-02-06)¶
- Another attempt to possibly fix #27.
- Implements cleaning inactive connections from the connection pool.
0.17.1 (2014-02-05)¶
- Another attempt to possibly fix #27.
- API: Refactored
ConnectionPool
. It now callsput
onHostConnectionPool
to avoid sharing a queue.
0.17 (2014-02-05)¶
- Implements cookie support.
- Fixes non-recursive downloads where robots.txt was checked unnecessarily.
- Possibly fix issue #27 where HTTP workers get stuck.
0.16.1 (2014-02-05)¶
- Adds some documentation about stopping Wpull and a list of all options.
- API:
Builder
now exposesFactory
. - API:
WebProcessorSession
was refactored to not pass arguments through the initializer. It also now usesDemuxDocumentScraper
andDemuxURLFilter
.
0.16 (2014-02-04)¶
- Implements all the SSL options:
--certificate
,--random-file
,--egd-file
,--secure-protocol
. - Further improvement on database performance.
0.15.2 (2014-02-03)¶
- Improves database performance on reducing CPU usage.
0.15.1 (2014-02-03)¶
- Improves database performance on reducing disk reading.
0.15 (2014-02-02)¶
- Fixes robots.txt being fetched for every request.
- Scripts: Supports
replace
as part ofget_urls()
. - Schema change: The database URL strings are normalized into a separate table. Using
--database
should now consume less disk space.
0.14.1 (2014-02-02)¶
- NameValueRecord now supports a
normalize_override
argument to how specific keys are cased instead of the default title-case. - Fixes WARC file’s field names to match the same cases as hanzo’s warc-tools. warc-tools does not support case-insensitivity as required by the WARC specification in section 4. The WARC files generated by Wpull are conformant however.
0.14 (2014-02-01)¶
- Database change: SQLAlchemy is now used for the URL Table.
- Scripts:
url_info['inline']
now returns a boolean, not an integer.
- Scripts:
- Implements
--post-data
and--post-file
. - Scripts can now return
post_data
andlink_type
as part ofget_urls()
.
0.13 (2014-01-31)¶
- Supports reading HTTP responses with gzip content type.
0.12 (2014-01-31)¶
- No changes to program usage itself.
- More documentation.
- Major API changes due to refactoring:
http.Body
moved toconversation.Body
document.HTTPScraper
,document.CSSScraper
moved toscraper
module.conversation
module now contains base classes for protocol elements.processor.WebProcessorSession
now uses keyword argumentsengine.Engine
requiresStatistics
argument.
0.11 (2014-01-29)¶
- Implements
--progress
which includes a progress bar indicator. - Bumps up the HTTP connection buffer size to support fast connections.
0.10.9 (2014-01-28)¶
- Adds documentation. No program changes.
0.10.8 (2014-01-26)¶
- Improves robustness against bad HTTP protocol messages.
- Fixes various URL and IRI handling issues.
- Fixes
--input-file
to work as expected. - Fixes command line arguments not working under Python 2.
0.10 (2014-01-23)¶
- Improves handling on URLs and document encodings.
- Implements
--ascii-print
. - Fixes Lua scripting conversion of Python to Lua object types.
0.9 (2014-01-21)¶
- Adds basic SSL options.
0.8 (2014-01-21)¶
- Supports Python and Lua scripting via
--python-script
and--lua-script
.
0.7 (2014-01-18)¶
- Fixes robots.txt support.
0.6 (2014-01-17)¶
- Implements
--warc-append
,--concurrent
. --read-timeout
default is 900 seconds.
0.5 (2014-01-17)¶
- Implements
--no-http-keepalive
,--rotate-dns
. - Adds basic support for HTTPS.
0.4 (2014-01-15)¶
- Implements
--continue
,--no-clobber
,--timestamping
.
0.3.2 (2014-01-07)¶
- Fixes database rows not saved correctly.
0.3 (2014-01-07)¶
- Implements
--hostnames
and--exclude-hostnames
.
0.2 (2014-01-06)¶
- Implements
--header
option. - Various 3to2 bug fixes.
0.1 (2014-01-05)¶
- The first usable release.
WARC Specification¶
Additional de-facto and custom extensions to the WARC standard.
Wpull follows the specifications in the ISO 28500 latest draft.
FTP¶
FTP recording follows Heritrix specifications.
Control Conversation¶
The Control Conversation is recorded as
- WARC-Type:
metadata
- Content-Type:
text/x-ftp-control-conversation
- WARC-Target-URI: a URL. For example,
ftp://anonymous@example.com/treasure.txt
- WARC-IP-Address: an IPv4 address with port or an IPv6 address with brackets and port
The resource is formatted as followed:
- Events are indented with an ASCII asterisk and space.
- Requests are indented with an ASCII greater-than and space.
- Responses are indented with an ASCII less-than and space.
The document encoding is UTF-8.
Changed in version 1.2a1: The document encoding previously used Latin-1.
Response data¶
The response data is recorded as
- WARC-Type:
resource
- WARC-Target-URI: a URL. For example,
ftp://anonymous@example.com/treasure.txt
- WARC-Concurrent-To: a WARC Record ID of the Control Conversation
PhantomJS¶
Snapshot¶
A PhantomJS Snapshot represents the state of the DOM at the time of capture.
A Snapshot is recorded as
- WARC-Type:
resource
- WARC-Target-URI:
urn:X-wpull:snapshot?url=URLHERE
whereURLHERE
is a percent-encoded URL of the PhantomJS page. - Content-Type: one of
application/pdf
,text/html
,image/png
- WARC-Concurrent-To: a WARC Record ID of a Snapshot Action Metadata.
Snapshot Action Metadata¶
An Action Metadata is a log of steps performed before a Snapshot is taken.
It is recorded as
- WARC-Type:
metadata
- Content-Type:
application/json
- WARC-Target-URI:
urn:X-wpull:snapshot?url=URLHERE
whereURLHERE
is a percent-encoded URL of the PhantomJS page.
Wpull Metadata¶
Log¶
Wpull’s log is recorded as
- WARC-Type:
resource
- Content-Type:
text/plain
- WARC-Target-URI:
urn:X-wpull:log
The document encoding is UTF-8.
youtube-dl¶
JSON file is recorded as
- WARC-Type:
metadata
- Content-Type:
application/vnd.youtube-dl_formats+json
- WARC-Target-URI:
metadata://AUTHORITY_AND_RESOURCE
whereAUTHORITY_AND_RESOURCE
is the hierarchical part, query, and fragment of the URL passed to youtube-dl. In other words, the URI is the URL where the scheme is replaced withmetadata
.