Welcome to Wpull’s documentation!¶
| Homepage: | https://github.com/chfoo/wpull |
|---|
Contents:
Introduction¶
Wpull is a Wget-compatible (or remake/clone/replacement/alternative) web downloader and crawler.
Notable Features:
- Written in Python: lightweight, modifiable, robust, & scriptable
- Graceful stopping; on-disk database resume
- PhantomJS & youtube-dl integration (experimental)
Wpull is designed to be (almost) a drop-in replacement for Wget with minimal changes to options. It is designed to run on much larger crawls rather than speedily downloading a single file.
Wpull’s behavior is not an exact duplicate of Wget’s behavior. As such, you should not expect exact output and operation out of Wpull. However, it aims to be a very useful alternative as its source code can be easily modified to fix, change, or extend its behaviors.
For instructions, read on to the next sections. Confused? Check out the Frequently Asked Questions.
Installation¶
Requirements¶
Wpull requires the following:
- Python 3.4.3 or greater
- Tornado 4.0 or greater
- html5lib
- Or lxml for faster but much worse HTML parsing
- chardet
- Or cchardet for faster version of chardet
- SQLAlchemy 0.9 or greater
The following are optional:
- psutil for monitoring disk space
- Manhole for a REPL debugging socket
- PhantomJS 1.9.8, 2.1 for capturing interactive JavaScript pages
- youtube-dl for downloading complex video streaming sites
For installing Wpull, it is recommended to use pip installer.
Wpull is officially supported in a Unix-like environment.
Automatic Install¶
Once you have installed Python, lxml, and pip, install Wpull with dependencies automatically from PyPI:
pip3 install wpull
Tip
Adding the --upgrade option will upgrade Wpull to the latest
release. Use --no-dependencies to only upgrade Wpull.
Adding the --user option will install Wpull into your home
directory.
Automatic install is usually the best option. However, there may be outstanding fixes to bugs that are not yet released to PyPI. In this case, use the manual install.
Manual Install¶
Install the dependencies known to work with Wpull:
pip3 install -r https://raw2.github.com/chfoo/wpull/master/requirements.txt
Install Wpull from GitHub:
pip3 install git+https://github.com/chfoo/wpull.git#egg=wpull
Tip
Using git+https://github.com/chfoo/wpull.git@develop#egg=wpull
as the path will install Wpull’s develop branch.
psutil¶
psutil is required for the disk and memory monitoring options but may not be available. To install:
pip3 install psutil
Pre-built Binaries¶
Wpull has pre-built binaries located at https://launchpad.net/wpull/+download. These are unsupported and may not be up to date.
Caveats¶
Python¶
Please obtain the latest Python release from http://python.org/download/ or your package manager. It is recommended to use Python 3.4.3 or greater. Versions 3.4 and 3.4 are officially supported.
Python 2 and PyPy are not supported.
lxml¶
It is recommended that lxml is obtained through an installer
or pre-built package. Windows packages are provided on
https://pypi.python.org/pypi/lxml. Debian/Ubuntu users
should install python3-lxml. For more information, see
http://lxml.de/installation.html.
pip¶
If pip is not installed on your system yet, please follow the instructions at http://www.pip-installer.org/en/latest/installing.html to install pip. Note for Linux users, ensure you are executing the appropriate Python version when installing pip.
PhantomJS (Optional)¶
It is recommended to download a prebuilt binary build from http://phantomjs.org/download.html.
Usage¶
Intro¶
Wpull is a command line oriented program much like Wget. It is non-interactive and requires all options to specified on start up. If you are not familiar with Wget, please see the Wikipedia article on Wget.
Example Commands¶
To download the About page of Google.com:
wpull google.com/about
To archive a website:
wpull billy.blogsite.example \
--warc-file blogsite-billy \
--no-check-certificate \
--no-robots --user-agent "InconspiuousWebBrowser/1.0" \
--wait 0.5 --random-wait --waitretry 600 \
--page-requisites --recursive --level inf \
--span-hosts-allow linked-pages,page-requisites \
--escaped-fragment --strip-session-id \
--sitemaps \
--reject-regex "/login\.php" \
--tries 3 --retry-connrefused --retry-dns-error \
--timeout 60 --session-timeout 21600 \
--delete-after --database blogsite-billy.db \
--quiet --output-file blogsite-billy.log
Wpull can also be invoked using:
python3 -m wpull
Stopping & Resuming¶
To gracefully stop Wpull, press CTRL+C (or send SIGINT). Wpull will quit once the current download has finished. To stop immediately, press CTRL+C again (or send SIGTERM).
If you have used the --database option, Wpull can reuse the
existing database for resuming crawls. This behavior is different than
--continue. Resuming with --continue is intended for resuming
partially downloaded files while --database is intended for resuming
partial crawls.
To resume a crawl provided you have used --database, simply reuse
the same command options from the previous run. This will maintain the
same behavior as the previous run. You may also tweak the options, for
example, limit the recursion depth.
Note
When resuming downloads with --warc-file and
--database, Wpull will overwrite the WARC file by default. This
occurs because Wpull simply maintains a list of URLs that are
fetched and not fetched. You should either rename the existing
file manually or use the additional option --warc-append or
move the files --warc-move.
Proxied Services¶
Wpull is able to use an HTTP proxy server to capture traffic from third-party programs such as PhantomJS.
The requests will go through the proxy to Wpull’s HTTP client (which can be recorded with --warc-file).
Warning
Wpull uses the HTTP proxy insecurely on localhost.
It is possible for another user, on the same machine as Wpull, to send bogus requests to the HTTP proxy. Wpull, however, does not expose the HTTP proxy outside to the net by default.
It is not possible to use the proxy standalone at this time.
PhantomJS Integration¶
PhantomJS support is currently experimental.
--phantomjs will enable PhantomJS integration.
If a HTML document is encountered, Wpull will open the URL in PhantomJS. After the page is loaded, Wpull will try to scroll the page as specified by --phantomjs-scroll. Then, the HTML DOM source is scraped for URLs as normal. HTML and PDF snapshots are taken by default.
Currently, Wpull will not do anything else to manipulate the page such as clicking on links. As a consequence, Wpull with PhantomJS is not a complete solution for dynamic web pages yet!
Storing console logs and alert messages inside the WARC file is not yet supported.
youtube-dl Integration¶
youtube-dl support is currently experimental.
--youtube-dl will enable youtube-dl integration.
If a HTML document is encountered, Wpull will run youtube-dl on the URL. Wpull uses the options for downloading subtitles and thumbnails. Other options are at the default which may not grab the best possible quality. For example, youtube-dl may not grab the highest quality stream because it is not a simple video file.
It is not recommended to use recursion because it may fetch redundant amounts of data.
Storing manifests, metadata, or converted files inside the WARC file is not yet supported.
Options¶
Wget-compatible web downloader and crawler.
usage: wpull [-h] [-V] [--plugin-script FILE] [--plugin-args PLUGIN_ARGS]
[--database FILE | --database-uri URI] [--concurrent N]
[--debug-console-port PORT] [--debug-manhole]
[--ignore-fatal-errors] [--monitor-disk MONITOR_DISK]
[--monitor-memory MONITOR_MEMORY] [-o FILE | -a FILE]
[-d | -v | -nv | -q | -qq] [--ascii-print]
[--report-speed TYPE={bits}] [-i FILE] [-F] [-B URL]
[--http-proxy HTTP_PROXY] [--https-proxy HTTPS_PROXY]
[--proxy-user USER] [--proxy-password PASS] [--no-proxy]
[--proxy-domains LIST] [--proxy-exclude-domains LIST]
[--proxy-hostnames LIST] [--proxy-exclude-hostnames LIST]
[-t NUMBER] [--retry-connrefused] [--retry-dns-error] [-O FILE]
[-nc] [-c] [--progress TYPE={bar,dot,none}] [-N]
[--no-use-server-timestamps] [-S] [-T SECONDS]
[--dns-timeout SECS] [--connect-timeout SECS]
[--read-timeout SECS] [--session-timeout SECS] [-w SECONDS]
[--waitretry SECONDS] [--random-wait] [-Q NUMBER]
[--bind-address ADDRESS] [--limit-rate RATE] [--no-dns-cache]
[--rotate-dns] [--no-skip-getaddrinfo]
[--restrict-file-names MODES=<ascii,lower,nocontrol,unix,upper,windows>]
[-4 | -6 | --prefer-family FAMILY={IPv4,IPv6,none}] [--user USER]
[--password PASSWORD] [--no-iri] [--local-encoding ENC]
[--remote-encoding ENC] [--max-filename-length NUMBER] [-nd | -x]
[-nH] [--protocol-directories] [-P PREFIX] [--cut-dirs NUMBER]
[--http-user HTTP_USER] [--http-password HTTP_PASSWORD]
[--no-cache] [--default-page NAME] [-E] [--ignore-length]
[--header STRING] [--max-redirect NUMBER] [--referer URL]
[--save-headers] [-U AGENT] [--no-robots] [--no-http-keep-alive]
[--no-cookies] [--load-cookies FILE] [--save-cookies FILE]
[--keep-session-cookies] [--post-data STRING | --post-file FILE]
[--content-disposition] [--content-on-error] [--http-compression]
[--html-parser {html5lib,libxml2-lxml}]
[--link-extractors <css,html,javascript>] [--escaped-fragment]
[--strip-session-id]
[--secure-protocol PR={SSLv3,TLSv1,TLSv1.1,TLSv1.2,auto}]
[--https-only] [--no-check-certificate] [--no-strong-crypto]
[--certificate FILE] [--certificate-type TYPE={PEM}]
[--private-key FILE] [--private-key-type TYPE={PEM}]
[--ca-certificate FILE] [--ca-directory DIR]
[--no-use-internal-ca-certs] [--random-file FILE]
[--edg-file FILE] [--ftp-user USER] [--ftp-password PASS]
[--no-remove-listing] [--no-glob] [--preserve-permissions]
[--retr-symlinks [{0,1,no,off,on,yes}]] [--warc-file FILENAME]
[--warc-append] [--warc-header STRING] [--warc-max-size NUMBER]
[--warc-move DIRECTORY] [--warc-cdx] [--warc-dedup FILE]
[--no-warc-compression] [--no-warc-digests] [--no-warc-keep-log]
[--warc-tempdir DIRECTORY] [-r] [-l NUMBER] [--delete-after] [-k]
[-K] [-p] [--page-requisites-level NUMBER] [--sitemaps] [-A LIST]
[-R LIST] [--accept-regex REGEX] [--reject-regex REGEX]
[--regex-type TYPE={pcre}] [-D LIST] [--exclude-domains LIST]
[--hostnames LIST] [--exclude-hostnames LIST] [--follow-ftp]
[--follow-tags LIST] [--ignore-tags LIST]
[-H | --span-hosts-allow LIST=<linked-pages,page-requisites>]
[-L] [-I LIST] [--trust-server-names] [-X LIST] [-np]
[--no-strong-redirects] [--proxy-server]
[--proxy-server-address ADDRESS] [--proxy-server-port PORT]
[--phantomjs] [--phantomjs-exe PATH]
[--phantomjs-max-time PHANTOMJS_MAX_TIME]
[--phantomjs-scroll NUM] [--phantomjs-wait SEC]
[--no-phantomjs-snapshot] [--no-phantomjs-smart-scroll]
[--youtube-dl] [--youtube-dl-exe PATH]
[URL [URL ...]]
- Positional arguments:
urls the URL to be downloaded - Options:
-V, --version show program’s version number and exit --plugin-script load plugin script from FILE --plugin-args arguments for the plugin --database save database tables into FILE instead of memory --database-uri save database tables at SQLAlchemy URI instead of memory --concurrent run at most N downloads at the same time --debug-console-port run a web debug console at given port number --debug-manhole install Manhole debugging socket --ignore-fatal-errors ignore all internal fatal exception errors --monitor-disk pause if minimum free disk space is exceeded --monitor-memory pause if minimum free memory is exceeded -o, --output-file write program messages to FILE -a, --append-output append program messages to FILE -d, --debug print debugging messages -v, --verbose print informative program messages and detailed progress -nv, --no-verbose print informative program messages and errors -q, --quiet print program error messages -qq, --very-quiet do not print program messages unless critical --ascii-print print program messages in ASCII only --report-speed print speed in bits only instead of human formatted units
Possible choices: bits
-i, --input-file download URLs listed in FILE -F, --force-html read URL input files as HTML files -B, --base resolves input relative URLs to URL --http-proxy HTTP proxy for HTTP requests --https-proxy HTTP proxy for HTTPS requests --proxy-user username for proxy “basic” authentication --proxy-password password for proxy “basic” authentication --no-proxy disable proxy support --proxy-domains use proxy only from LIST of hostname suffixes --proxy-exclude-domains don’t use proxy only from LIST of hostname suffixes --proxy-hostnames use proxy only from LIST of hostnames --proxy-exclude-hostnames don’t use proxy only from LIST of hostnames -t, --tries try NUMBER of times on transient errors --retry-connrefused retry even if the server does not accept connections --retry-dns-error retry even if DNS fails to resolve hostname -O, --output-document stream every document into FILE -nc, --no-clobber don’t use anti-clobbering filenames -c, --continue resume downloading a partially-downloaded file --progress choose the type of progress indicator
Possible choices: dot, bar, none
-N, --timestamping only download files that are newer than local files --no-use-server-timestamps don’t set the last-modified time on files -S, --server-response print the protocol responses from the server -T, --timeout set DNS, connect, read timeout options to SECONDS --dns-timeout timeout after SECS seconds for DNS requests --connect-timeout timeout after SECS seconds for connection requests --read-timeout timeout after SECS seconds for reading requests --session-timeout timeout after SECS seconds for downloading files -w, --wait wait SECONDS seconds between requests --waitretry wait up to SECONDS seconds on retries --random-wait randomly perturb the time between requests -Q, --quota stop after downloading NUMBER bytes --bind-address bind to ADDRESS on the local host --limit-rate limit download bandwidth to RATE --no-dns-cache disable caching of DNS lookups --rotate-dns use different resolved IP addresses on requests --no-skip-getaddrinfo always use the OS’s name resolver interface --restrict-file-names list of safe filename modes to use
Possible choices: unix, nocontrol, ascii, windows, lower, upper
-4, --inet4-only connect to IPv4 addresses only -6, --inet6-only connect to IPv6 addresses only --prefer-family prefer to connect to FAMILY IP addresses
Possible choices: none, IPv6, IPv4
--user username for both FTP and HTTP authentication --password password for both FTP and HTTP authentication --no-iri use ASCII encoding only --local-encoding use ENC as the encoding of input files and options --remote-encoding force decoding documents using codec ENC --max-filename-length limit filename length to NUMBER characters -nd, --no-directories don’t create directories -x, --force-directories always create directories -nH, --no-host-directories don’t create directories for hostnames --protocol-directories create directories for URL schemes -P, --directory-prefix save everything under the directory PREFIX --cut-dirs don’t make NUMBER of leading directories --http-user username for HTTP authentication --http-password password for HTTP authentication --no-cache request server to not use cached version of files --default-page use NAME as index page if not known -E, --adjust-extension append HTML or CSS file extension if needed --ignore-length ignore any Content-Length provided by the server --header adds STRING to the HTTP header --max-redirect follow only up to NUMBER document redirects --referer always use URL as the referrer --save-headers include server header responses in files -U, --user-agent use AGENT instead of Wpull’s user agent --no-robots ignore robots.txt directives --no-http-keep-alive disable persistent HTTP connections --no-cookies disables HTTP cookie support --load-cookies load Mozilla cookies.txt from FILE --save-cookies save Mozilla cookies.txt to FILE --keep-session-cookies include session cookies when saving cookies to file --post-data use POST for all requests with query STRING --post-file use POST for all requests with query in FILE --content-disposition use filename given in Content-Disposition header --content-on-error keep error pages --http-compression request servers to use HTTP compression --html-parser select HTML parsing library and strategy
Possible choices: libxml2-lxml, html5lib
--link-extractors specify which link extractors to use
Possible choices: css, html, javascript
--escaped-fragment rewrite links with hash fragments to escaped fragments --strip-session-id remove session ID tokens from links --secure-protocol specify the version of the SSL protocol to use
Possible choices: SSLv3, TLSv1, TLSv1.1, TLSv1.2, auto
--https-only download only HTTPS URLs --no-check-certificate don’t validate SSL server certificates --no-strong-crypto don’t use secure protocols/ciphers --certificate use FILE containing the local client certificate --certificate-type Undocumented
Possible choices: PEM
--private-key use FILE containing the local client private key --private-key-type Undocumented
Possible choices: PEM
--ca-certificate load and use CA certificate bundle from FILE --ca-directory load and use CA certificates from DIR --no-use-internal-ca-certs don’t use CA certificates included with Wpull --random-file use data from FILE to seed the SSL PRNG --edg-file connect to entropy gathering daemon using socket FILE --ftp-user username for FTP login --ftp-password password for FTP login --no-remove-listing keep directory file listings --no-glob don’t use filename glob patterns on FTP URLs --preserve-permissions apply server’s Unix file permissions on downloaded files --retr-symlinks if disabled, preserve symlinks and run with security risks
Possible choices: yes, on, 1, off, no, 0
--warc-file save WARC file to filename prefixed with FILENAME --warc-append append instead of overwrite the output WARC file --warc-header include STRING in WARC file metadata --warc-max-size write sequential WARC files sized about NUMBER bytes --warc-move move WARC files to DIRECTORY as they complete --warc-cdx write CDX file along with the WARC file --warc-dedup write revisit records using digests in FILE --no-warc-compression do not compress the WARC file --no-warc-digests do not compute and save SHA1 hash digests --no-warc-keep-log do not save a log into the WARC file --warc-tempdir use temporary DIRECTORY for preparing WARC files -r, --recursive follow links and download them -l, --level limit recursion depth to NUMBER --delete-after download files temporarily and delete them after -k, --convert-links rewrite links in files that point to local files -K, --backup-converted save original files before converting their links -p, --page-requisites download objects embedded in pages --page-requisites-level limit page-requisites recursion depth to NUMBER --sitemaps download Sitemaps to discover more links -A, --accept download only files with suffix in LIST -R, --reject don’t download files with suffix in LIST --accept-regex download only URLs matching REGEX --reject-regex don’t download URLs matching REGEX --regex-type use regex TYPE
Possible choices: pcre
-D, --domains download only from LIST of hostname suffixes --exclude-domains don’t download from LIST of hostname suffixes --hostnames download only from LIST of hostnames --exclude-hostnames don’t download from LIST of hostnames --follow-ftp follow links to FTP sites --follow-tags follow only links contained in LIST of HTML tags --ignore-tags don’t follow links contained in LIST of HTML tags -H, --span-hosts follow links and page requisites to other hostnames --span-hosts-allow selectively span hosts for resource types in LIST
Possible choices: page-requisites, linked-pages
-L, --relative follow only relative links -I, --include-directories download only paths in LIST --trust-server-names use the last given URL for filename during redirects -X, --exclude-directories don’t download paths in LIST -np, --no-parent don’t follow to parent directories on URL path --no-strong-redirects don’t implicitly allow span hosts for redirects --proxy-server run HTTP proxy server for capturing requests --proxy-server-address bind the proxy server to ADDRESS --proxy-server-port bind the proxy server port to PORT --phantomjs use PhantomJS for loading dynamic pages --phantomjs-exe path of PhantomJS executable --phantomjs-max-time maximum duration of PhantomJS session --phantomjs-scroll scroll the page up to NUM times --phantomjs-wait wait SEC seconds between page interactions --no-phantomjs-snapshot don’t take dynamic page snapshots --no-phantomjs-smart-scroll always scroll the page to maximum scroll count option --youtube-dl use youtube-dl for downloading videos --youtube-dl-exe path of youtube-dl executable
Defaults may differ depending on the operating system. Use --help to see them.
This is only a programmatically generated listing from the program. In most cases, you can follow Wget’s documentation options. Wpull will follow Wget’s behavior so please check Wget online documentation and resources before asking questions.
Differences between Wpull and Wget¶
In most cases, Wpull can be substituted with Wget easily. However, some options may not be implemented yet. This section describes the reasons for option differences.
Missing in Wpull¶
--background--execute--config--spider--ignore-case--ask-password--unlink--method--body-data--body-file--auth-no-challenge: Temporarily on by default, but specifying the option is not yet available. Digest authentication is not yet supported.--no-passive-ftp--mirror--strict-comments: No plans for support of this option.--regex-type=posix: No plans to support posix regex.- Features greater than Wget 1.15.
Missing in Wget¶
--plugin-plugin: This provides scripting hooks.--plugin-args--database: Enables the use of the on-disk database.--database-uri--concurrent: Allows changing the number of downloads that happen at once.--debug-console-port--debug-manhole--ignore-fatal-errors--monitor-disk: Avoids filling the disk.--monitor-memory--very-quiet--ascii-print: Force replaces Unicode text with escaped values for environments that are ASCII only.--http-proxy:--https-proxy--proxy-domains--proxy-exclude-domains--proxy-hostnames--proxy-exclude-hostnames--retry-dns-error: Wget considers DNS errors as non-recoverable.--session-timeout: Abort downloading infinite MP3 streams.--no-skip-getaddrinfo--no-robots: Wpull is designed for archiving.--http-compression(gzip, deflate, & raw deflate)--html-parser: HTML parsing libraries have many trade-offs. Pick any two: small, fast, reliable.--link-extractors--escaped-fragment: Try to force HTML rendering instead of Javascript.--strip-session-id--no-strong-crypto--no-use-internal-ca-certs--warc-append--warc-move: Move WARC files out of the way for resuming a crashed crawl.--page-requisites-level: Prevent infinite downloading of misconfurged server resources such as HTML served under a image.--sitemaps: Discover more URLs.--hostnames: Wget simply matches the endings when using--domainsinstead of matching each part of the hostname.--exclude-hostnames--span-hosts-allow: Allow fetching things such as images hosted on another domain.--no-strong-redirects--proxy-server--proxy-server-address--proxy-server-port--phantomjs--phantomjs-exe--phantomjs-max-time--phantomjs-scroll--phantomjs-wait--no-phantomjs-snapshot--no-phantomjs-smart-scroll--youtube-dl--youtube-dl-exe
Help¶
Frequently Asked Questions¶
What does it mean by “Wget-compatible”?¶
It means that Wpull behaves similarly to Wget, but the internal machinery that powers Wpull is completely different from Wget.
What advantages does Wpull offer over Wget?¶
The motivation for the development of Wpull is to find a replacement for Wget that does not store URLs in memory and is scriptable.
Wpull has support for using a on-disk database so memory requirements remain constant. Wget only stores URLs in memory, Wget will eventually run out of memory if you want to crawl millions of URLs at once.
Another motivation is to provide hooks that accept/reject URLs during the crawl.
What advantages does Wget offer over Wpull?¶
Wget is much more mature and stable. With many developers working on Wget, bug fixes and features arrive faster.
Wget is also written in C which can handle text much faster. Wpull is written in Python which was not designed for blazing fast processing of data. This means that Wpull can be slow processing large documents.
How can change things while it is running? / Is there a GUI or web interface to make things easier?¶
Wpull does not offer a user-friendly interface to make changes as it runs at this time. However, please check out https://github.com/ludios/grab-site which is a web interface built on of Wpull
Wpull is giving an error or not performing correctly.¶
Check that you have the options correct. In most cases, it is a misunderstanding of Wget options.
Otherwise if Wpull is not doing what you want, please visit the issue tracker and see if your issue is there. If not, please inform the developers by creating a new issue.
When you open a new issue, GitHub provides a link to the guidelines document. Please read it to learn how to file a good bug report.
How can I help the development of Wpull? What are the development goals?¶
Please visit the [GitHub repository](https://github.com/chfoo/wpull). From there, you can take a look at:
- The Contributing file for specific instructions on how to help
- The issue tracker for current bugs and features
- The Wiki for the roadmap of the project such as goals and statuses
- And the code, of course
How can I chat or ask a question?¶
For chatting and quick questions, please visit the “unoffical” IRC channel: #archiveteam-bs on EFNet. (Click here if you do not have an IRC client.)
Alternatively if the discussion is lengthy, please use the issue tracker as described above. As a courtesy, if your question is answered on the issue tracker, please close the issue to mark your question as solved.
We highly prefer that you use IRC or the issue tracker. But email is also available: chris.foo@gmail.com
Plugin Scripting Hooks¶
Wpull’s scripting support is modelled after alard’s Wget with Lua hooks.
Scripts are installed using the YAPSY plugin architecture. To create your plugin
script, subclass wpull.application.plugin.WpullPlugin and
load it with --plugin-script option.
The plugin interface provides two type of callbacks: hooks and events.
Hook¶
Hooks change the behavior of the program. When the callback is
registered to the hook, it is required to provide a return value
typically one of wpull.application.hook.Actions. Only
one callback may be registered to a hook.
To register your callback, decorate your callback with
wpull.application.plugin.hook().
Event¶
Events are points in the program that occur and are notified to registered listeners.
To register your callback, decorate your callback with
wpull.application.plugin.event().
Interfaces¶
The global hooks and events constants are located at
wpull.application.plugin.PluginFunctions.
PluginFunctions.accept_url- hook Interface:
FetchRule.plugin_accept_url PluginFunctions.dequeued_url- event Interface:
URLTableHookWrapper.dequeued_url PluginFunctions.exit_status- hook Interface:
AppStopTask.plugin_exit_status PluginFunctions.finishing_statistics- event Interface:
StatsStopTask.plugin_finishing_statistics PluginFunctions.get_urls- event Interface:
ProcessingRule.plugin_get_urls PluginFunctions.handle_error- hook Interface:
ResultRule.plugin_handle_error PluginFunctions.handle_pre_response- hook Interface:
ResultRule.plugin_handle_pre_response PluginFunctions.handle_response- hook Interface:
ResultRule.plugin_handle_response PluginFunctions.queued_url- event Interface:
URLTableHookWrapper.queued_url PluginFunctions.resolve_dns- hook Interface:
Resolver.resolve_dns PluginFunctions.resolve_dns_result- event Interface:
Resolver.resolve_dns_result PluginFunctions.wait_time- hook Interface:
ResultRule.plugin_wait_time
Example¶
Here is a example Python script. It
- Prints hello on start up
- Refuses to download anything with the word “dog” in the URL
- Scrapes URLs on a hypothetical homepage
- Stops the program execution when the server returns HTTP 429
import datetime
import re
from wpull.application.hook import Actions
from wpull.application.plugin import WpullPlugin, PluginFunctions, hook
from wpull.protocol.abstract.request import BaseResponse
from wpull.pipeline.session import ItemSession
class MyExamplePlugin(WpullPlugin):
def activate(self):
super().activate()
print('Hello world!')
def deactivate(self):
super().deactivate()
print('Goodbye world!')
@hook(PluginFunctions.accept_url)
def my_accept_func(self, item_session: ItemSession, verdict: bool, reasons: dict) -> bool:
return 'dog' not in item_session.request.url
@event(PluginFunctions.get_urls)
def my_get_urls(self, item_session: ItemSession):
if item_session.request.url_info.path != '/':
return
matches = re.finditer(
r'<div id="profile-(\w+)"', item_session.response.body.content
)
for match in matches:
url = 'http://example.com/profile.php?username={}'.format(
match.group(1)
)
item_session.add_child_url(url)
@hook(PluginFunctions.handle_response)
def my_handle_response(item_session: ItemSession):
if item_session.response.response_code == 429:
return Actions.STOP
API¶
Wpull was designed as a command line program and most users do not need to read this section. However, you may be using the scripting hook interface or you may want to reuse a component.
Since Wpull is generally not a library, API backwards compatibility is provided on a best-effort basis; there is no guarantee on whether public or private functions will remain the same. This rule does not include the scripting hook interface which is designed for backwards compatibility.
Here lists all documented classes and functions. Not all members are documented yet. Some members, such as the backported modules, are not documented here.
If the documentation is not sufficient, please take a look at the source code. Suggestions and improvements are welcomed.
Note
The API is not thread-safe. It is intended to be run asynchronously with Asyncio.
Many functions also are decorated with the asyncio.coroutine() decorator. For more information, see https://docs.python.org/3/library/asyncio.html.
wpull Package¶
application Module¶
application.app Module¶
Application main interface.
-
class
wpull.application.app.Application(pipeline_series: wpull.pipeline.pipeline.PipelineSeries)[source]¶ Bases:
wpull.application.hook.HookableMixinDefault non-interactive application user interface.
This class manages process signals and displaying warnings.
-
ERROR_CODE_MAP= OrderedDict([(<class 'wpull.errors.AuthenticationError'>, 6), (<class 'wpull.errors.ServerError'>, 8), (<class 'wpull.errors.ProtocolError'>, 7), (<class 'wpull.errors.SSLVerificationError'>, 5), (<class 'wpull.errors.DNSNotFound'>, 4), (<class 'wpull.errors.ConnectionRefused'>, 4), (<class 'wpull.errors.NetworkError'>, 4), (<class 'OSError'>, 3)])¶ Mapping of error types to exit status.
-
EXPECTED_EXCEPTIONS= (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.DNSNotFound'>, <class 'wpull.errors.ConnectionRefused'>, <class 'wpull.errors.NetworkError'>, <class 'OSError'>, <class 'OSError'>, <class 'wpull.application.hook.HookStop'>, <class 'StopIteration'>, <class 'SystemExit'>, <class 'KeyboardInterrupt'>)¶ Exception classes that are not crashes.
-
Application.exit_code¶
-
application.builder Module¶
Application support.
-
class
wpull.application.builder.Builder(args, unit_test=False)[source]¶ Bases:
objectApplication builder.
Parameters: args – Options from argparse.ArgumentParser-
factory¶ Return the Factory.
Returns: An factory.Factoryinstance.Return type: Factory
-
application.factory Module¶
Instance creation and management.
-
class
wpull.application.factory.Factory(class_map=None)[source]¶ Bases:
collections.abc.Mapping,objectAllows selection of classes and keeps track of instances.
This class behaves like a mapping. Keys are names of classes and values are instances.
-
class_map¶ A mapping of names to class types.
-
instance_map¶ A mapping of names to instances.
-
application.hook Module¶
Python and Lua scripting support.
See Plugin Scripting Hooks for an introduction.
-
class
wpull.application.hook.Actions[source]¶ Bases:
enum.EnumActions for handling responses and errors.
-
NORMAL¶ normal
Use Wpull’s original behavior.
-
RETRY¶ retry
Retry this item (as if an error has occurred).
-
FINISH¶ finish
Consider this item as done; don’t do any further processing on it.
-
-
exception
wpull.application.hook.HookAlreadyConnectedError[source]¶ Bases:
ValueErrorA callback is already connected to the hook.
-
exception
wpull.application.hook.HookDisconnected[source]¶ Bases:
RuntimeErrorNo callback is connected.
-
class
wpull.application.hook.HookDispatcher(event_dispatcher_transclusion: typing.Union=None)[source]¶ Bases:
collections.abc.MappingDynamic callback hook system.
application.main Module¶
application.options Module¶
Program options.
-
class
wpull.application.options.AppArgumentParser(*args, real_exit=True, **kwargs)[source]¶ Bases:
argparse.ArgumentParserAn Argument Parser that builds up the application options.
-
class
wpull.application.options.AppHelpFormatter(prog, indent_increment=2, max_help_position=24, width=None)[source]¶ Bases:
argparse.HelpFormatter
-
class
wpull.application.options.CommaChoiceListArgs[source]¶ Bases:
frozensetSpecialized frozenset.
This class overrides the
__contains__function to allow the use of theinoperator for ArgumentParser’schoiceschecking for comma separated lists. The function behaves differently only when the objects compared are CommaChoiceListArgs.
application.plugin Module¶
-
wpull.application.plugin.PluginClientFunctionInfo¶ alias of
_PluginClientFunctionInfo
application.plugins Module¶
application.plugins.arg_warning.plugin Module¶
application.plugins.debug_console.plugin Module¶
application.plugins.download_progress.plugin Module¶
application.plugins.server_response.plugin Module¶
application.tasks Module¶
application.tasks.conversion Module¶
-
class
wpull.application.tasks.conversion.QueuedFileSession(app_session: wpull.pipeline.app.AppSession, file_id: int, url_record: wpull.pipeline.item.URLRecord)[source]¶ Bases:
object
application.tasks.database Module¶
application.tasks.download Module¶
application.tasks.log Module¶
application.tasks.plugin Module¶
-
class
wpull.application.tasks.plugin.PluginLocator(directories, paths)[source]¶ Bases:
yapsy.IPluginLocator.IPluginLocator
application.tasks.resmon Module¶
application.tasks.rule Module¶
application.tasks.shutdown Module¶
-
class
wpull.application.tasks.shutdown.AppStopTask[source]¶ Bases:
wpull.pipeline.pipeline.ItemTask,wpull.application.hook.HookableMixin-
static
plugin_exit_status(app_session: wpull.pipeline.app.AppSession, exit_code: int) → int[source]¶ Return the program exit status code.
Exit codes are values from
errors.ExitStatus.Parameters: exit_code – The exit code Wpull wants to return. Returns: The exit code that Wpull will return. Return type: int
-
static
application.tasks.sslcontext Module¶
application.tasks.stats Module¶
-
class
wpull.application.tasks.stats.StatsStopTask[source]¶ Bases:
wpull.pipeline.pipeline.ItemTask,wpull.application.hook.HookableMixin-
static
plugin_finishing_statistics(app_session: wpull.pipeline.app.AppSession, statistics: wpull.stats.Statistics)[source]¶ Callback containing final statistics.
Parameters: - start_time (float) – timestamp when the engine started
- end_time (float) – timestamp when the engine stopped
- num_urls (int) – number of URLs downloaded
- bytes_downloaded (int) – size of files downloaded in bytes
-
static
application.tasks.warc Module¶
body Module¶
Request and response payload.
-
class
wpull.body.Body(file=None, directory=None, hint='lone_body')[source]¶ Bases:
objectRepresents the document/payload of a request or response.
This class is a wrapper around a file object. Methods are forwarded to the underlying file object.
-
file¶ file
The file object.
Parameters: - file (file, optional) – Use the given file as the file object.
- directory (str) – If file is not given, use directory for a new temporary file.
- hint (str) – If file is not given, use hint as a filename infix.
-
cache Module¶
Caching.
-
class
wpull.cache.CacheItem(key, value, time_to_live=None, access_time=None)[source]¶ Bases:
objectInfo about an item in the cache.
Parameters: - key – The key
- value – The value
- time_to_live – The time in seconds of how long to keep the item
- access_time – The timestamp of the last use of the item
-
expire_time¶ When the item expires.
-
class
wpull.cache.FIFOCache(max_items=None, time_to_live=None)[source]¶ Bases:
wpull.cache.BaseCacheFirst in first out object cache.
Parameters: - max_items (int) – The maximum number of items to keep.
- time_to_live (float) – Discard items after time_to_live seconds.
Reusing a key to update a value will not affect the expire time of the item.
-
class
wpull.cache.LRUCache(max_items=None, time_to_live=None)[source]¶ Bases:
wpull.cache.FIFOCacheLeast recently used object cache.
Parameters: - max_items – The maximum number of items to keep
- time_to_live – The time in seconds of how long to keep the item
-
wpull.cache.total_ordering(obj)¶
collections Module¶
Data structures.
-
class
wpull.collections.FrozenDict(orig_dict)[source]¶ Bases:
collections.abc.Mapping,collections.abc.HashableImmutable mapping wrapper.
-
hash_cache¶
-
orig_dict¶
-
-
class
wpull.collections.LinkedList[source]¶ Bases:
objectDoubly linked list.
-
map¶ dict
A mapping of values to nodes.
-
head¶ -
The first node.
-
tail¶ -
The last node.
-
-
class
wpull.collections.LinkedListNode(value, head=None, tail=None)[source]¶ Bases:
objectA node in a
LinkedList.-
value¶ Any value.
-
head¶ LinkedListNode
The node in front.
-
tail¶ LinkedListNode
The node in back.
-
head
-
tail
-
value
-
converter Module¶
Document content post-processing.
-
class
wpull.converter.BaseDocumentConverter[source]¶ Bases:
objectBase class for classes that convert links within a document.
-
class
wpull.converter.BatchDocumentConverter(html_parser, element_walker, url_table, backup=False)[source]¶ Bases:
objectConvert all documents in URL table.
Parameters: - url_table – An instance of
database.URLTable. - backup (bool) – Whether back up files are created.
- url_table – An instance of
-
class
wpull.converter.CSSConverter(url_table)[source]¶ Bases:
wpull.scraper.css.CSSScraper,wpull.converter.BaseDocumentConverterCSS converter.
-
class
wpull.converter.HTMLConverter(html_parser, element_walker, url_table)[source]¶ Bases:
wpull.scraper.html.HTMLScraper,wpull.converter.BaseDocumentConverterHTML converter.
cookie Module¶
HTTP Cookies.
Bases:
http.cookiejar.FileCookieJarMozillaCookieJar that is compatible with Wget/Curl.
It ignores file header checks and supports session cookies.
Bases:
http.cookiejar.DefaultCookiePolicyCookie policy that limits the content and length of the cookie.
Parameters: cookie_jar – The CookieJar instance. This policy class is not designed to be shared between CookieJar instances.
Return approximate length of all cookie key-values for a domain.
Return the number of cookies for the given domain.
cookiewrapper Module¶
Wrappers that wrap instances to Python standard library.
Bases:
objectWraps a CookieJar.
Parameters: - cookie_jar – An instance of
http.cookiejar.CookieJar. - save_filename (str, optional) – A filename to save the cookies.
- keep_session_cookies (bool) – If True, session cookies are kept when saving to file.
Wrapped
add_cookie_header.Parameters: - request – An instance of
http.request.Request. - referrer_host (str) – An hostname or IP address of the referrer URL.
- request – An instance of
Save the cookie jar if needed.
Return the wrapped Cookie Jar.
Wrapped
extract_cookies.Parameters: - response – An instance of
http.request.Response. - request – An instance of
http.request.Request. - referrer_host (str) – An hostname or IP address of the referrer URL.
- response – An instance of
- cookie_jar – An instance of
Bases:
objectWraps a HTTP Response.
Parameters: response – An instance of http.request.ResponseReturn the header fields as a Message:
Returns: An instance of email.message.Message. If Python 2, returns an instance ofmimetools.Message.Return type: Message
Convert a HTTP request.
Parameters: - request – An instance of
http.request.Request. - referrer_host (str) – The referrering hostname or IP address.
Returns: An instance of
urllib.request.RequestReturn type: - request – An instance of
database Module¶
Storage for tracking URLs.
database.base Module¶
Base table class.
-
wpull.database.base.AddURLInfo¶ alias of
_AddURLInfo
-
class
wpull.database.base.BaseURLTable[source]¶ Bases:
objectURL table.
-
add_many(new_urls: typing.Iterator) → typing.Iterator[source]¶ Add the URLs to the table.
Parameters: new_urls – URLs to be added. Returns: The URLs added. Useful for tracking duplicates.
-
add_one(url: str, url_properties: typing.Union=None, url_data: typing.Union=None)[source]¶ Add a single URL to the table.
Parameters: - url – The URL to be added
- url_properties – Additional values to be saved
- url_data – Additional data to be saved
-
add_visits(visits)[source]¶ Add visited URLs from CDX file.
Parameters: visits (iterable) – An iterable of items. Each item is a tuple containing a URL, the WARC ID, and the payload digest.
-
check_in(url: str, new_status: wpull.pipeline.item.Status, increment_try_count: bool=True, url_result: typing.Union=None)[source]¶ Update record for processed URL.
Parameters: - url – The URL.
- new_status – Update the item status to new_status.
- increment_try_count – Whether to increment the try counter for the URL.
- url_result – Additional values.
-
check_out(filter_status: wpull.pipeline.item.Status, filter_level: typing.Union=None) → wpull.pipeline.item.URLRecord[source]¶ Find a URL, mark it in progress, and return it.
Parameters: - filter_status – Gets first item with given status.
- filter_level – Gets item with filter_level or lower.
Raises:
-
get_one(url: str) → wpull.pipeline.item.URLRecord[source]¶ Return a URLRecord for the URL.
Raises: NotFound
-
-
exception
wpull.database.base.NotFound[source]¶ Bases:
wpull.database.base.DatabaseErrorItem not found in the table.
database.sqlmodel Module¶
Database SQLAlchemy model.
-
wpull.database.sqlmodel.DBBase¶ alias of
Base
-
class
wpull.database.sqlmodel.QueuedURL(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.Base-
filename¶ Local filename of the item.
-
id¶
-
inline_level¶ Depth of the page requisite object. 0 is the object, 1 is the object’s dependency, etc.
-
level¶ Recursive depth of the item. 0 is root, 1 is child of root, etc.
-
link_type¶ Expected content type of extracted link.
-
parent_url¶ A descriptor that presents a read/write view of an object attribute.
-
parent_url_string¶
-
parent_url_string_id¶ Optional referral URL
-
post_data¶ Additional percent-encoded data for POST.
-
priority¶ Priority of item.
-
root_url¶ A descriptor that presents a read/write view of an object attribute.
-
root_url_string¶
-
root_url_string_id¶ Optional root URL
-
status¶ Status of the completion of the item.
-
status_code¶ HTTP status code or FTP rely code.
-
try_count¶ Number of attempts made in order to process the item.
-
url¶ A descriptor that presents a read/write view of an object attribute.
-
url_string¶
-
url_string_id¶ Target URL to fetch
-
-
class
wpull.database.sqlmodel.URLString(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.BaseTable containing the URL strings.
The
URLreferences this table.-
id¶
-
url¶
-
-
class
wpull.database.sqlmodel.WARCVisit(**kwargs)[source]¶ Bases:
sqlalchemy.ext.declarative.api.BaseStandalone table for
--cdx-dedupfeature.-
payload_digest¶
-
url¶
-
warc_id¶
-
database.sqltable Module¶
SQLAlchemy table implementations.
-
class
wpull.database.sqltable.SQLiteURLTable(path=':memory:')[source]¶ Bases:
wpull.database.sqltable.BaseSQLURLTableURL table with SQLite storage.
Parameters: path – A SQLite filename
-
class
wpull.database.sqltable.GenericSQLURLTable(url)[source]¶ Bases:
wpull.database.sqltable.BaseSQLURLTableURL table using SQLAlchemy without any customizations.
Parameters: url – A SQLAlchemy database URL.
-
wpull.database.sqltable.URLTable¶ The default URL table implementation.
alias of
SQLiteURLTable
database.wrap Module¶
URL table wrappers.
-
class
wpull.database.wrap.URLTableHookWrapper(url_table)[source]¶ Bases:
wpull.database.base.BaseURLTable,wpull.application.hook.HookableMixinURL table wrapper with scripting hooks.
Parameters: url_table – URL table. -
url_table¶ URL table.
-
static
dequeued_url(url_info: wpull.url.URLInfo, record_info: wpull.pipeline.item.URLRecord)[source]¶ Callback fired after an URL was retrieved from the queue.
-
debug Module¶
Debugging utilities.
-
class
wpull.debug.DebugConsoleHandler(application, request, **kwargs)[source]¶ Bases:
tornado.web.RequestHandler-
TEMPLATE= '<html>\n <style>\n #commandbox {{\n width: 100%;\n }}\n </style>\n <body>\n <p>Welcome to DEBUG CONSOLE!</p>\n <p><tt>Builder()</tt> instance at <tt>wpull_builder</tt>.</p>\n <form method="post">\n <input id="commandbox" name="command" value="{command}">\n <input type="submit" value="Execute">\n </form>\n <pre>{output}</pre>\n </body>\n </html>\n '¶
-
decompression Module¶
Streaming decompressors.
-
class
wpull.decompression.DeflateDecompressor[source]¶ Bases:
wpull.decompression.SimpleGzipDecompressorzlib decompressor with raw deflate detection.
This class doesn’t do any special. It only tries regular zlib and then tries raw deflate on the first decompress.
-
class
wpull.decompression.GzipDecompressor[source]¶ Bases:
wpull.decompression.SimpleGzipDecompressorgzip decompressor with gzip header detection.
This class checks if the stream starts with the 2 byte gzip magic number. If it is not present, it returns the bytes unchanged.
-
class
wpull.decompression.SimpleGzipDecompressor[source]¶ Bases:
objectStreaming gzip decompressor.
The interface is like that of zlib.decompressobj (without some of the optional arguments, but it understands gzip headers and checksums.
document Module¶
Document handling.
document.base Module¶
Document bases.
-
class
wpull.document.base.BaseDocumentDetector[source]¶ Bases:
objectBase class for classes that detect document types.
-
classmethod
is_file(file)[source]¶ Return whether the reader is likely able to read the file.
Parameters: file – A file object containing the document. Returns: bool
-
classmethod
is_request(request)[source]¶ Return whether the request is likely supported.
Parameters: request ( http.request.Request) – An HTTP request.Returns: bool
-
classmethod
is_response(response)[source]¶ Return whether the response is likely able to be read.
Parameters: response ( http.request.Response) – An HTTP response.Returns: bool
-
classmethod
is_supported(file=None, request=None, response=None, url_info=None)[source]¶ Given the hints, return whether the document is supported.
Parameters: - file – A file object containing the document.
- request (
http.request.Request) – An HTTP request. - response (
http.request.Response) – An HTTP response. - url_info (
url.URLInfo) – A URLInfo.
Returns: If True, the reader should be able to read it.
Return type: bool
-
classmethod
is_url(url_info)[source]¶ Return whether the URL is likely to be supported.
Parameters: url_info ( url.URLInfo) – A URLInfo.Returns: bool
-
classmethod
-
class
wpull.document.base.BaseExtractiveReader[source]¶ Bases:
objectBase class for document readers that can only extract links.
-
class
wpull.document.base.BaseHTMLReader[source]¶ Bases:
objectBase class for document readers for handling SGML-like documents.
-
iter_elements(file, encoding=None)[source]¶ Return an iterator of elements found in the document.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
Returns: Each item is an element from
document.htmlparse.elementReturn type: iterator
-
-
class
wpull.document.base.BaseTextStreamReader[source]¶ Bases:
objectBase class for document readers that filters link and non-link text.
-
iter_links(file, encoding=None, context=False)[source]¶ Return the links.
This function is a convenience function for calling
iter_text()and returning only the links.
-
iter_text(file, encoding=None)[source]¶ Return the file text and links.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
Returns: Each item is a tuple:
- str: The text
- bool (or truthy value): Whether the text is a likely a link. If truthy value may be provided containing additional context of the link.
Return type: iterator
The links returned are raw text and will require further processing.
-
-
wpull.document.base.VeryFalse= <wpull.document.base.VeryFalseType object>¶ Document is not definitely supported.
document.css Module¶
Stylesheet reader.
-
class
wpull.document.css.CSSReader[source]¶ Bases:
wpull.document.base.BaseDocumentDetector,wpull.document.base.BaseTextStreamReaderCascading Stylesheet Document Reader.
-
BUFFER_SIZE= 1048576¶
-
IMPORT_URL_PATTERN= '@import\\s*(?:url\\()?[\'"]?([^\\s\'")]{1,500}).*?;'¶
-
STREAM_REWIND= 4096¶
-
URL_PATTERN= 'url\\(\\s*([\'"]?)(.{1,500}?)(?:\\1)\\s*\\)'¶
-
URL_REGEX= re.compile('url\\(\\s*([\'"]?)(.{1,500}?)(?:\\1)\\s*\\)|@import\\s*(?:url\\()?[\'"]?([^\\s\'")]{1,500}).*?;')¶
-
document.html Module¶
HTML document readers.
-
wpull.document.html.COMMENT= <object object>¶ Comment element
-
class
wpull.document.html.HTMLLightParserTarget(callback, text_elements=frozenset({'style', 'script', 'link', 'url', 'icon'}))[source]¶ Bases:
objectAn HTML parser target for partial elements.
Parameters: - callback –
A callback function. The function should accept the :param 1. tag: The tag name of the element. :type 1. tag: str :param 2. attrib: The attributes of the element. :type 2. attrib: dict :param 3. text: The text of the element.
type 3. text: str, None - text_elements – A frozenset of element tag names that we should keep track of text.
- callback –
-
class
wpull.document.html.HTMLParserTarget(callback)[source]¶ Bases:
objectAn HTML parser target.
Parameters: callback – A callback function. The function should accept the :param 1. tag: The tag name of the element. :type 1. tag: str :param 2. attrib: The attributes of the element. :type 2. attrib: dict :param 3. text: The text of the element. :type 3. text: str, None :param 4. tail: The text after the element. :type 4. tail: str, None :param 5. end: Whether the tag is and end tag.
type 5. end: bool
-
class
wpull.document.html.HTMLReadElement(tag, attrib, text, tail, end)[source]¶ Bases:
objectResults from
HTMLReader.read_links().-
tag¶ str
The element tag name.
-
attrib¶ dict
The element attributes.
-
text¶ str, None
The element text.
-
tail¶ str, None
The text after the element.
-
end¶ bool
Whether the tag is an end tag.
-
attrib
-
end
-
tag
-
tail
-
text
-
-
class
wpull.document.html.HTMLReader(html_parser)[source]¶ Bases:
wpull.document.base.BaseDocumentDetector,wpull.document.base.BaseHTMLReaderHTML document reader.
Parameters: html_parser ( document.htmlparse.BaseParser) – An HTML parser.
document.htmlparse Module¶
HTML parsing.
document.htmlparse.base Module¶
document.htmlparse.element Module¶
HTML tree things.
-
wpull.document.htmlparse.element.Comment¶ A comment.
-
wpull.document.htmlparse.element.text¶ str
The comment text.
alias of
CommentType-
-
wpull.document.htmlparse.element.Doctype¶ A Doctype.
-
wpull.document.htmlparse.element.text str
The Doctype text.
alias of
DoctypeType-
-
wpull.document.htmlparse.element.Element¶ An HTML element.
- Attributes
- tag (str): The tag name of the element. attrib (dict): The attributes of the element. text (str, None): The text of the element. tail (str, None): The text after the element. end (bool): Whether the tag is and end tag.
alias of
ElementType
document.htmlparse.html5lib_ Module¶
Parsing using html5lib python.
document.htmlparse.lxml_ Module¶
Parsing using lxml and libxml2.
-
class
wpull.document.htmlparse.lxml_.HTMLParser[source]¶ Bases:
wpull.document.htmlparse.base.BaseParserHTML document parser.
This reader uses lxml as the parser.
-
BUFFER_SIZE= 131072¶
-
classmethod
detect_parser_type(file, encoding=None)[source]¶ Get the suitable parser type for the document.
Returns: str
-
classmethod
parse_doctype(file, encoding=None)[source]¶ Get the doctype from the document.
Returns: str, None
-
parse_lxml(file, encoding=None, target_class=<class 'wpull.document.htmlparse.lxml_.HTMLParserTarget'>, parser_type='html')[source]¶ Return an iterator of elements found in the document.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
- target_class – A class to be used for target parsing.
- parser_type (str) – The type of parser to use. Accepted values:
html,xhtml,xml.
Returns: Each item is an element from
document.htmlparse.elementReturn type: iterator
-
parser_error¶
-
-
class
wpull.document.htmlparse.lxml_.HTMLParserTarget(callback)[source]¶ Bases:
objectAn HTML parser target.
Parameters: callback – A callback function. The function should accept one argument from document.htmlparse.element.
document.javascript Module¶
-
class
wpull.document.javascript.JavaScriptReader[source]¶ Bases:
wpull.document.base.BaseDocumentDetector,wpull.document.base.BaseTextStreamReaderJavaScript Document Reader.
-
BUFFER_SIZE= 1048576¶
-
STREAM_REWIND= 4096¶
-
URL_PATTERN= '(\\\\{0,8}[\'"])(https?://[^\'"]{1,500}|[^\\s\'"]{1,500})(?:\\1)'¶
-
URL_REGEX= re.compile('(\\\\{0,8}[\'"])(https?://[^\'"]{1,500}|[^\\s\'"]{1,500})(?:\\1)')¶
-
document.sitemap Module¶
Sitemap.xml
-
class
wpull.document.sitemap.SitemapReader(html_parser)[source]¶ Bases:
wpull.document.base.BaseDocumentDetector,wpull.document.base.BaseExtractiveReaderSitemap XML reader.
-
MAX_ROBOTS_FILE_SIZE= 4096¶
-
document.util Module¶
Misc functions.
-
wpull.document.util.detect_response_encoding(response, is_html=False, peek=131072)[source]¶ Return the likely encoding of the response document.
Parameters: - response (Response) – An instance of
http.Response. - is_html (bool) – See
util.detect_encoding(). - peek (int) – The maximum number of bytes of the document to be analyzed.
Returns: The codec name.
Return type: str,None- response (Response) – An instance of
document.xml Module¶
XML document.
driver Module¶
Interprocess communicators.
driver.phantomjs Module¶
-
class
wpull.driver.phantomjs.PhantomJSDriver(exe_path='phantomjs', extra_args=None, params=None)[source]¶ Bases:
wpull.driver.process.ProcessPhantomJS processing.
Parameters: - exe_path (str) – Path of the PhantomJS executable.
- extra_args (list) – Additional arguments for PhantomJS. Most likely, you’ll want to pass proxy settings for capturing traffic.
- params (
PhantomJSDriverParams) – Parameters for controlling the processing pipeline.
This class launches PhantomJS that scrolls and saves snapshots. It can only be used once per URL.
-
wpull.driver.phantomjs.PhantomJSDriverParams¶ PhantomJS Driver parameters
-
wpull.driver.phantomjs.url¶ str
URL of page to fetch.
-
wpull.driver.phantomjs.snapshot_type¶ list
List of filenames. Accepted extensions are html, pdf, png, gif.
-
wpull.driver.phantomjs.wait_time¶ float
Time between page scrolls.
-
wpull.driver.phantomjs.num_scrolls¶ int
Maximum number of scrolls.
-
wpull.driver.phantomjs.smart_scroll¶ bool
Whether to stop scrolling if number of requests & responses do not change.
-
wpull.driver.phantomjs.snapshot¶ bool
Whether to take snapshot files.
-
wpull.driver.phantomjs.viewport_size¶ tuple
Width and height of the page viewport.
-
wpull.driver.phantomjs.paper_size¶ tuple
Width and height of the paper size.
-
wpull.driver.phantomjs.event_log_filename¶ str
Path to save page events.
-
wpull.driver.phantomjs.action_log_filename¶ str
Path to save page action manipulation events.
-
wpull.driver.phantomjs.custom_headers¶ dict
Custom HTTP request headers.
-
wpull.driver.phantomjs.page_settings¶ dict
Page settings.
alias of
PhantomJSDriverParamsType-
driver.process Module¶
RPC processes.
errors Module¶
Exceptions.
-
exception
wpull.errors.AuthenticationError[source]¶ Bases:
wpull.errors.ServerErrorUsername or password error.
-
exception
wpull.errors.ConnectionRefused[source]¶ Bases:
wpull.errors.NetworkErrorServer was online, but nothing was being served.
-
exception
wpull.errors.DNSNotFound[source]¶ Bases:
wpull.errors.NetworkErrorServer’s IP address could not be located.
-
wpull.errors.ERROR_PRIORITIES= (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.AuthenticationError'>, <class 'wpull.errors.DNSNotFound'>, <class 'wpull.errors.ConnectionRefused'>, <class 'wpull.errors.NetworkError'>, <class 'OSError'>, <class 'OSError'>, <class 'ValueError'>)¶ List of error classes by least severe to most severe.
-
class
wpull.errors.ExitStatus[source]¶ Bases:
objectProgram exit status codes.
-
generic_error¶ 1
An unclassified serious or fatal error occurred.
-
parser_error¶ 2
A local document or configuration file could not be parsed.
-
file_io_error¶ 3
A problem with reading/writing a file occurred.
-
network_failure¶ 4
A problem with the network occurred such as a DNS resolver error or a connection was refused.
-
ssl_verification_error¶ 5
A server’s SSL/TLS certificate was invalid.
-
authentication_failure¶ 6
A problem with a username or password.
-
protocol_error¶ 7
A problem with communicating with a server occurred.
-
server_error¶ 8
The server had problems fulfilling our requests.
-
authentication_failure= 6
-
file_io_error= 3
-
generic_error= 1
-
network_failure= 4
-
parser_error= 2
-
protocol_error= 7
-
server_error= 8
-
ssl_verification_error= 5
-
-
exception
wpull.errors.NetworkTimedOut[source]¶ Bases:
wpull.errors.NetworkErrorConnection read/write timed out.
-
wpull.errors.SSLVerficationError¶ alias of
SSLVerificationError
namevalue Module¶
Key-value pairs.
-
class
wpull.namevalue.NameValueRecord(normalize_overrides=None, encoding='utf-8', wrap_width=None)[source]¶ Bases:
collections.abc.MutableMappingAn ordered mapping of name-value pairs.
Duplicated names are accepted.
-
wpull.namevalue.guess_line_ending(string)[source]¶ Return the most likely line delimiter from the string.
-
wpull.namevalue.normalize_name(name, overrides=None)[source]¶ Normalize the key name to title case.
For example,
normalize_name('content-id')will becomeContent-IdParameters: - name (str) – The name to normalize.
- overrides (set, sequence) – A set or sequence containing keys that
should be cased to themselves. For example, passing
set('WARC-Type')will normalize any key named “warc-type” toWARC-Typeinstead of the defaultWarc-Type.
Returns: str
network Module¶
network.bandwidth Module¶
Network bandwidth.
-
class
wpull.network.bandwidth.BandwidthLimiter(rate_limit)[source]¶ Bases:
wpull.network.bandwidth.BandwidthMeterBandwidth rate limit calculator.
-
class
wpull.network.bandwidth.BandwidthMeter(sample_size=20, sample_min_time=0.15, stall_time=5.0)[source]¶ Bases:
objectCalculates the speed of data transfer.
Parameters: - sample_size (int) – The number of samples for measuring the speed.
- sample_min_time (float) – The minimum duration between samples in seconds.
- stall_time (float) – The time in seconds to consider no traffic to be connection stalled.
-
bytes_transferred¶ Return the number of bytes transferred
Returns: int
-
feed(data_len, feed_time=None)[source]¶ Update the bandwidth meter.
Parameters: - data_len (int) – The number of bytes transfered since the last
call to
feed(). - feed_time (float) – Current time.
- data_len (int) – The number of bytes transfered since the last
call to
-
num_samples¶ Return the number of samples collected.
-
speed()[source]¶ Return the current transfer speed.
Returns: The speed in bytes per second. Return type: int
-
stalled¶ Return whether the connection is stalled.
Returns: bool
network.connection Module¶
Network connections.
-
class
wpull.network.connection.BaseConnection(address: tuple, hostname: typing.Union=None, timeout: typing.Union=None, connect_timeout: typing.Union=None, bind_host: typing.Union=None, sock: typing.Union=None)[source]¶ Bases:
objectBase network stream.
Parameters: - address – 2-item tuple containing the IP address and port or 4-item for IPv6.
- hostname – Hostname of the address (for SSL).
- timeout – Time in seconds before a read/write operation times out.
- connect_timeout – Time in seconds before a connect operation times out.
- bind_host – Host name for binding the socket interface.
- sock – Use given socket. The socket must already by connected.
-
reader¶ Stream Reader instance.
-
writer¶ Stream Writer instance.
-
address¶ 2-item tuple containing the IP address.
-
host¶ Host name.
-
port¶ Port number.
-
address
-
host
-
hostname¶
-
port
-
class
wpull.network.connection.CloseTimer(timeout, connection)[source]¶ Bases:
objectPeriodic timer to close connections if stalled.
-
class
wpull.network.connection.Connection(*args, bandwidth_limiter=None, **kwargs)[source]¶ Bases:
wpull.network.connection.BaseConnectionNetwork stream.
Parameters: (class (bandwidth_limiter) – .bandwidth.BandwidthLimiter): Bandwidth limiter for connection speed limiting. -
key¶ Value used by the ConnectionPool for its host pool map. Internal use only.
-
wrapped_connection¶ A wrapped connection for ConnectionPool. Internal use only.
-
is_ssl¶ bool
Whether connection is SSL.
-
proxied¶ bool
Whether the connection is to a HTTP proxy.
-
tunneled¶ bool
Whether the connection has been tunneled with the
CONNECTrequest.
-
is_ssl
-
proxied
-
start_tls(ssl_context: typing.Union=True) → 'SSLConnection'[source]¶ Start client TLS on this connection and return SSLConnection.
Coroutine
-
tunneled
-
-
class
wpull.network.connection.ConnectionState[source]¶ Bases:
enum.EnumState of a connection
-
ready¶ Connection is ready to be used
-
created¶ connect has been called successfully
-
dead¶ Connection is closed
-
network.dns Module¶
DNS resolution.
-
wpull.network.dns.AddressInfo¶ Socket address.
alias of
_AddressInfo
-
class
wpull.network.dns.ResolveResult(address_infos: typing.List, dns_infos: typing.Union=None)[source]¶ Bases:
objectDNS resolution information.
-
addresses¶ The socket addresses.
-
dns_infos¶ The DNS resource records.
-
first_ipv4¶ The first IPv4 address.
-
first_ipv6¶ The first IPV6 address.
-
-
class
wpull.network.dns.Resolver(family: wpull.network.dns.IPFamilyPreference=<IPFamilyPreference.any: 'any'>, timeout: typing.Union=None, bind_address: typing.Union=None, cache: typing.Union=None, rotate: bool=False)[source]¶ Bases:
wpull.application.hook.HookableMixinAsynchronous resolver with cache and timeout.
Parameters: - family – IPv4 or IPv6 preference.
- timeout – A time in seconds used for timing-out requests. If not specified, this class relies on the underlying libraries.
- bind_address – An IP address to bind DNS requests if possible.
- cache – Cache to store results of any query.
- rotate – If result is cached rotates the results, otherwise, shuffle the results.
-
resolve(host: str) → wpull.network.dns.ResolveResult[source]¶ Resolve hostname.
Parameters: host – Hostname.
Returns: Resolved IP addresses.
Raises: - DNSNotFound if the hostname could not be resolved or
- NetworkError if there was an error connecting to DNS servers.
Coroutine.
-
static
resolve_dns(host: str) → str[source]¶ Resolve the hostname to an IP address.
Parameters: host – The hostname. This callback is to override the DNS lookup.
It is useful when the server is no longer available to the public. Typically, large infrastructures will change the DNS settings to make clients no longer hit the front-ends, but rather go towards a static HTTP server with a “We’ve been acqui-hired!” page. In these cases, the original servers may still be online.
Returns: Noneto use the original behavior or a string containing an IP address or an alternate hostname.Return type: str, None
network.pool Module¶
-
class
wpull.network.pool.ConnectionPool(max_host_count: int=6, resolver: typing.Union=None, connection_factory: typing.Union=None, ssl_connection_factory: typing.Union=None, max_count: int=100)[source]¶ Bases:
objectConnection pool.
Parameters: - max_host_count – Number of connections per host.
- resolver – DNS resolver.
- connection_factory – A function that accepts
addressandhostnamearguments and returns aConnectioninstance. - ssl_connection_factory – A function that returns a
SSLConnectioninstance. See connection_factory. - max_count – Limit on number of connections
-
acquire(host: str, port: int, use_ssl: bool=False, host_key: typing.Any=None) → wpull.network.connection.Connection[source]¶ Return an available connection.
Parameters: - host – A hostname or IP address.
- port – Port number.
- use_ssl – Whether to return a SSL connection.
- host_key – If provided, it overrides the key used for per-host connection pooling. This is useful for proxies for example.
Coroutine.
-
clean(force: bool=False)[source]¶ Clean all closed connections.
Parameters: force – Clean connected and idle connections too. Coroutine.
-
close()[source]¶ Close all the connections and clean up.
This instance will not be usable after calling this method.
-
host_pools¶
-
no_wait_release(connection: wpull.network.connection.Connection)[source]¶ Synchronous version of
release().
-
class
wpull.network.pool.HappyEyeballsConnection(address, connection_factory, resolver, happy_eyeballs_table, is_ssl=False)[source]¶ Bases:
objectWrapper for happy eyeballs connection.
-
class
wpull.network.pool.HostPool(connection_factory: typing.Callable, max_connections: int=6)[source]¶ Bases:
objectConnection pool for a host.
-
ready¶ Queue
Connections not in use.
-
busy¶ set
Connections in use.
-
acquire() → wpull.network.connection.Connection[source]¶ Register and return a connection.
Coroutine.
-
clean(force: bool=False)[source]¶ Clean closed connections.
Parameters: force – Clean connected and idle connections too. Coroutine.
-
observer Module¶
Observer.
path Module¶
File names and paths.
-
class
wpull.path.PathNamer(root, index='index.html', use_dir=False, cut=None, protocol=False, hostname=False, os_type='unix', no_control=True, ascii_only=True, case=None, max_filename_length=None)[source]¶ Bases:
wpull.path.BasePathNamerPath namer that creates a directory hierarchy based on the URL.
Parameters: - root (str) – The base path.
- index (str) – The filename to use when the URL path does not indicate one.
- use_dir (bool) – Include directories based on the URL path.
- cut (int) – Number of leading directories to cut from the file path.
- protocol (bool) – Include the URL scheme in the directory structure.
- hostname (bool) – Include the hostname in the directory structure.
- safe_filename_args (dict) – Keyword arguments for safe_filename.
See also:
url_to_filename(),url_to_dir_path(),safe_filename().
-
class
wpull.path.PercentEncoder(unix=False, control=False, windows=False, ascii_=False)[source]¶ Bases:
collections.defaultdictPercent encoder.
-
wpull.path.anti_clobber_dir_path(dir_path, suffix='.d')[source]¶ Return a directory path free of filenames.
Parameters: - dir_path (str) – A directory path.
- suffix (str) – The suffix to append to the part of the path that is a file.
Returns: str
-
wpull.path.safe_filename(filename, os_type='unix', no_control=True, ascii_only=True, case=None, encoding='utf8', max_length=None)[source]¶ Return a safe filename or path part.
Parameters: - filename (str) – The filename or path component.
- os_type (str) – If
unix, escape the slash. Ifwindows, escape extra Windows characters. - no_control (bool) – If True, escape control characters.
- ascii_only (bool) – If True, escape non-ASCII characters.
- case (str) – If
lower, lowercase the string. Ifupper, uppercase the string. - encoding (str) – The character encoding.
- max_length (int) – The maximum length of the filename.
This function assumes that filename has not already been percent-encoded.
Returns: str
-
wpull.path.url_to_dir_parts(url, include_protocol=False, include_hostname=False, alt_char=False)[source]¶ Return a list of directory parts from a URL.
Parameters: - url (str) – The URL.
- include_protocol (bool) – If True, the scheme from the URL will be included.
- include_hostname (bool) – If True, the hostname from the URL will be included.
- alt_char (bool) – If True, the character for the port deliminator
will be
+intead of:.
This function does not include the filename and the paths are not sanitized.
Returns: list
-
wpull.path.url_to_filename(url, index='index.html', alt_char=False)[source]¶ Return a filename from a URL.
Parameters: - url (str) – The URL.
- index (str) – If a filename could not be derived from the URL path,
use index instead. For example,
/images/will returnindex.html. - alt_char (bool) – If True, the character for the query deliminator
will be
@intead of?.
This function does not include the directories and does not sanitize the filename.
Returns: str
pipeline Module¶
pipeline.app Module¶
-
class
wpull.pipeline.app.AppSession(factory: wpull.application.factory.Factory, args, stderr)[source]¶ Bases:
object
pipeline.item Module¶
URL items.
-
class
wpull.pipeline.item.LinkType[source]¶ Bases:
enum.EnumThe type of contents that a link is expected to have.
-
css= None¶ Stylesheet file. Recursion on links is usually safe.
-
directory= None¶ FTP directory.
-
file= None¶ FTP File.
-
html= None¶ HTML document.
-
javascript= None¶ JavaScript file. Possible to recurse links on this file.
-
media= None¶ Image or video file. Recursion on this type will not be useful.
-
sitemap= None¶ A Sitemap.xml file.
-
-
class
wpull.pipeline.item.Status[source]¶ Bases:
enum.EnumURL status.
-
done= None¶ The item has been processed successfully.
-
error= None¶ The item encountered an error during processing.
-
in_progress= None¶ The item is in progress of being processed.
-
skipped= None¶ The item was excluded from processing due to some rejection filters.
-
todo= None¶ The item has not yet been processed.
-
-
class
wpull.pipeline.item.URLData[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixinData associated fetching the URL.
- post_data (str): If given, the URL should be fetched as a
- POST request containing post_data.
-
database_attributes= ('post_data',)¶
-
class
wpull.pipeline.item.URLProperties[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixinURL properties that determine whether a URL is fetched.
-
parent_url¶ str
The parent or referral URL that linked to this URL.
-
root_url¶ str
The earliest ancestor URL of this URL. This URL is typically the URL supplied at the start of the program.
-
status¶ Status
Processing status of this URL.
-
try_count¶ int
The number of attempts on this URL.
-
level¶ int
The recursive depth of this URL. A level of
0indicates the URL was initially supplied to the program (the top URL). Level1means the URL was linked from the top URL.
-
inline_level¶ int
Whether this URL was an embedded object (such as an image or a stylesheet) of the parent URL.
The value represents the recursive depth of the object. For example, an iframe is depth 1 and the images in the iframe is depth 2.
-
link_type¶ LinkType
Describes the expected document type.
-
database_attributes= ('parent_url', 'root_url', 'status', 'try_count', 'level', 'inline_level', 'link_type', 'priority')¶
-
parent_url_info¶ Return URL Info for the parent URL
-
root_url_info¶ Return URL Info for the root URL
-
-
class
wpull.pipeline.item.URLRecord[source]¶ Bases:
wpull.pipeline.item.URLProperties,wpull.pipeline.item.URLData,wpull.pipeline.item.URLResultAn entry in the URL table describing a URL to be downloaded.
-
url¶ str
The URL.
-
url_info¶ Return URL Info for this URL
-
-
class
wpull.pipeline.item.URLResult[source]¶ Bases:
wpull.pipeline.item.URLDatabaseMixinData associated with the fetched URL.
status_code (int): The HTTP or FTP status code. filename (str): The path to where the file was saved.
-
database_attributes= ('status_code', 'filename')¶
-
pipeline.pipeline Module¶
-
class
wpull.pipeline.pipeline.Pipeline(item_source: wpull.pipeline.pipeline.ItemSource, tasks: typing.Sequence, item_queue: typing.Union=None)[source]¶ Bases:
object-
concurrency¶
-
tasks¶
-
-
class
wpull.pipeline.pipeline.PipelineSeries(pipelines: typing.Iterator)[source]¶ Bases:
object-
concurrency¶
-
concurrency_pipelines¶
-
pipelines¶
-
pipeline.progress Module¶
-
class
wpull.pipeline.progress.BarProgress(*args, draw_interval: float=0.5, bar_width: int=25, human_format: bool=True, **kwargs)[source]¶
-
class
wpull.pipeline.progress.Progress(stream: typing.IO=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='ANSI_X3.4-1968'>)[source]¶ Bases:
wpull.application.hook.HookableMixinPrint file download progress as dots or a bar.
Parameters: - bar_style (bool) – If True, print as a progress bar. If False, print dots every few seconds.
- stream – A file object. Default is usually stderr.
- human_format (true) – If True, format sizes in units. Otherwise, output bits only.
-
class
wpull.pipeline.progress.ProtocolProgress(*args, **kwargs)[source]¶ Bases:
wpull.pipeline.progress.Progress-
ProtocolProgress.update_from_begin_request(request: wpull.protocol.abstract.request.BaseRequest)[source]¶
-
ProtocolProgress.update_from_begin_response(response: wpull.protocol.abstract.request.BaseResponse)[source]¶
-
pipeline.session Module¶
-
class
wpull.pipeline.session.ItemSession(app_session: wpull.pipeline.app.AppSession, url_record: wpull.pipeline.item.URLRecord)[source]¶ Bases:
objectItem for a URL that needs to processed.
-
add_child_url(url: str, inline: bool=False, link_type: typing.Union=None, post_data: typing.Union=None, level: typing.Union=None, replace: bool=False)[source]¶ Add links scraped from the document with automatic values.
Parameters: - url – A full URL. (It can’t be a relative path.)
- inline – Whether the URL is an embedded object.
- link_type – Expected link type.
- post_data – URL encoded form data. The request will be made using POST. (Don’t use this to upload files.)
- level – The child depth of this URL.
- replace – Whether to replace the existing entry in the database table so it will be redownloaded again.
This function provides values automatically for:
inlinelevelparent: The referrering page.root
See also
add_url().
-
child_url_record(url: str, inline: bool=False, link_type: typing.Union=None, post_data: typing.Union=None, level: typing.Union=None)[source]¶ Return a child URLRecord.
This function is useful for testing filters before adding to table.
-
is_processed¶ Return whether the item has been processed.
-
is_virtual¶
-
request¶
-
response¶
-
processor Module¶
Item processing.
processor.base Module¶
Base classes for processors.
-
class
wpull.processor.base.BaseProcessor[source]¶ Bases:
objectBase class for processors.
Processors contain the logic for processing requests.
-
class
wpull.processor.base.BaseProcessorSession[source]¶ Bases:
objectBase class for processor sessions.
-
wpull.processor.base.REMOTE_ERRORS= (<class 'wpull.errors.ServerError'>, <class 'wpull.errors.ProtocolError'>, <class 'wpull.errors.SSLVerificationError'>, <class 'wpull.errors.NetworkError'>)¶ List of error classes that are errors that occur with a server.
processor.coprocessor Module¶
Additional processing not associated with Wget behavior.
processor.coprocessor.phantomjs Module¶
PhantomJS page loading and scrolling.
-
class
wpull.processor.coprocessor.phantomjs.PhantomJSCoprocessor(phantomjs_driver_factory: typing.Callable, processing_rule: wpull.processor.rule.ProcessingRule, phantomjs_params: wpull.processor.coprocessor.phantomjs.PhantomJSParamsType, warc_recorder=None, root_path='.')[source]¶ Bases:
objectPhantomJS coprocessor.
Parameters: - phantomjs_driver_factory – Callback function that accepts
paramsargument and returns PhantomJSDriver - processing_rule – Processing rule.
- warc_recorder – WARC recorder.
- root_dir (str) – Root directory path for temp files.
- phantomjs_driver_factory – Callback function that accepts
-
class
wpull.processor.coprocessor.phantomjs.PhantomJSCoprocessorSession(phantomjs_driver_factory, root_path, processing_rule, file_writer_session, request, response, item_session: wpull.pipeline.session.ItemSession, params, warc_recorder)[source]¶ Bases:
objectPhantomJS coprocessor session.
-
exception
wpull.processor.coprocessor.phantomjs.PhantomJSCrashed[source]¶ Bases:
ExceptionPhantomJS exited with non-zero code.
-
wpull.processor.coprocessor.phantomjs.PhantomJSParams¶ PhantomJS parameters
-
wpull.processor.coprocessor.phantomjs.snapshot_type¶ list
File types. Accepted are html, pdf, png, gif.
-
wpull.processor.coprocessor.phantomjs.wait_time¶ float
Time between page scrolls.
-
wpull.processor.coprocessor.phantomjs.num_scrolls¶ int
Maximum number of scrolls.
-
wpull.processor.coprocessor.phantomjs.smart_scroll¶ bool
Whether to stop scrolling if number of requests & responses do not change.
-
wpull.processor.coprocessor.phantomjs.snapshot¶ bool
Whether to take snapshot files.
-
wpull.processor.coprocessor.phantomjs.viewport_size¶ tuple
Width and height of the page viewport.
-
wpull.processor.coprocessor.phantomjs.paper_size¶ tuple
Width and height of the paper size.
-
wpull.processor.coprocessor.phantomjs.load_time¶ float
Maximum time to wait for page load.
-
wpull.processor.coprocessor.phantomjs.custom_headers¶ dict
Default HTTP headers.
-
wpull.processor.coprocessor.phantomjs.page_settings¶ dict
Page settings.
alias of
PhantomJSParamsType-
processor.coprocessor.proxy Module¶
-
class
wpull.processor.coprocessor.proxy.ProxyCoprocessor(app_session: wpull.pipeline.app.AppSession)[source]¶ Bases:
objectProxy coprocessor.
processor.coprocessor.youtubedl Module¶
-
class
wpull.processor.coprocessor.youtubedl.Session(proxy_address, youtube_dl_path, root_path, item_session: wpull.pipeline.session.ItemSession, file_writer_session, user_agent, warc_recorder, inet_family, check_certificate)[source]¶ Bases:
objectyoutube-dl session.
processor.delegate Module¶
Delegation to other processor.
processor.ftp Module¶
FTP
-
class
wpull.processor.ftp.FTPProcessor(ftp_client: wpull.protocol.ftp.client.Client, fetch_params)[source]¶ Bases:
wpull.processor.base.BaseProcessorFTP processor.
Parameters: - ftp_client – The FTP client.
- fetch_params (
WebProcessorFetchParams) – Parameters for fetching.
-
fetch_params¶ The fetch parameters.
-
ftp_client¶ The ftp client.
-
listing_cache¶ Listing cache.
Returns: A cache mapping from URL to list of ftp.ls.listing.FileEntry.
-
wpull.processor.ftp.FTPProcessorFetchParams¶ FTPProcessorFetchParams
Parameters: - remove_listing (bool) – Remove .listing files after fetching.
- glob (bool) – Enable URL globbing.
- preserve_permissions (bool) – Preserve file permissions.
- follow_symlinks (bool) – Follow symlinks.
alias of
FTPProcessorFetchParamsType
-
class
wpull.processor.ftp.FTPProcessorSession(processor: wpull.processor.ftp.FTPProcessor, item_session: wpull.pipeline.session.ItemSession)[source]¶ Bases:
wpull.processor.base.BaseProcessorSessionFetches FTP files or directory listings.
-
exception
wpull.processor.ftp.HookPreResponseBreak[source]¶ Bases:
wpull.errors.ProtocolErrorHook pre-response break.
processor.rule Module¶
Fetching rules.
-
class
wpull.processor.rule.FetchRule(url_filter: wpull.urlfilter.DemuxURLFilter=None, robots_txt_checker: wpull.protocol.http.robots.RobotsTxtChecker=None, http_login: typing.Union=None, ftp_login: typing.Union=None, duration_timeout: typing.Union=None)[source]¶ Bases:
wpull.application.hook.HookableMixinDecide on what URLs should be fetched.
-
check_ftp_request(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
check_generic_request(item_session: wpull.pipeline.session.ItemSession) → typing.Tuple[source]¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
check_initial_web_request(item_session: wpull.pipeline.session.ItemSession, request: wpull.protocol.http.request.Request) → typing.Tuple[source]¶ Check robots.txt, URL filters, and scripting hook.
Returns: (bool, str) Return type: tuple Coroutine.
-
check_subsequent_web_request(item_session: wpull.pipeline.session.ItemSession, is_redirect: bool=False) → typing.Tuple[source]¶ Check URL filters and scripting hook.
Returns: (bool, str) Return type: tuple
-
consult_filters(url_info: wpull.url.URLInfo, url_record: wpull.pipeline.item.URLRecord, is_redirect: bool=False) → typing.Tuple[source]¶ Consult the URL filter.
Parameters: - url_record – The URL record.
- is_redirect – Whether the request is a redirect and it is desired that it spans hosts.
- Returns
tuple:
- bool: The verdict
- str: A short reason string: nofilters, filters, redirect
- dict: The result from
DemuxURLFilter.test_info()
-
consult_hook(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reason: str, test_info: dict)[source]¶ Consult the scripting hook.
Returns: (bool, str) Return type: tuple
-
consult_robots_txt(request: wpull.protocol.http.request.Request) → bool[source]¶ Consult by fetching robots.txt as needed.
Parameters: request – The request to be made to get the file. Returns: True if can fetch Coroutine
-
classmethod
is_only_span_hosts_failed(test_info: dict) → bool[source]¶ Return whether only the SpanHostsFilter failed.
-
static
plugin_accept_url(item_session: wpull.pipeline.session.ItemSession, verdict: bool, reasons: dict) → bool[source]¶ Return whether to download this URL.
Parameters: - item_session – Current URL item.
- verdict – A bool indicating whether Wpull wants to download the URL.
- reasons –
A dict containing information for the verdict:
filters(dict): A mapping (str to bool) from filter name to whether the filter passed or not.reason(str): A short reason string. Current values are:filters,robots,redirect.
Returns: If
True, the URL should be downloaded. Otherwise, the URL is skipped.
-
-
class
wpull.processor.rule.ProcessingRule(fetch_rule: wpull.processor.rule.FetchRule, document_scraper: wpull.scraper.base.DemuxDocumentScraper=None, sitemaps: bool=False, url_rewriter: wpull.urlrewrite.URLRewriter=None)[source]¶ Bases:
wpull.application.hook.HookableMixinDocument processing rules.
Parameters: - fetch_rule – The FetchRule instance.
- document_scraper – The document scraper.
-
add_extra_urls(item_session: wpull.pipeline.session.ItemSession)[source]¶ Add additional URLs such as robots.txt, favicon.ico.
-
static
parse_url(url, encoding='utf-8')¶ Parse and return a URLInfo.
This function logs a warning if the URL cannot be parsed and returns None.
-
static
plugin_get_urls(item_session: wpull.pipeline.session.ItemSession)[source]¶ Add additional URLs to be added to the URL Table.
When this event is dispatched, the caller should add any URLs needed using
ItemSession.add_child_url().
-
class
wpull.processor.rule.ResultRule(ssl_verification: bool=False, retry_connrefused: bool=False, retry_dns_error: bool=False, waiter: typing.Union=None, statistics: typing.Union=None)[source]¶ Bases:
wpull.application.hook.HookableMixinDecide on the results of a fetch.
Parameters: - ssl_verification – If True, don’t ignore certificate errors.
- retry_connrefused – If True, don’t consider a connection refused error to be a permanent error.
- retry_dns_error – If True, don’t consider a DNS resolution error to be permanent error.
- waiter – The Waiter.
- statistics – The Statistics.
-
consult_error_hook(item_session: wpull.pipeline.session.ItemSession, error: BaseException)[source]¶ Return scripting action when an error occured.
-
consult_pre_response_hook(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return scripting action when a response begins.
-
consult_response_hook(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return scripting action when a response ends.
-
get_wait_time(item_session: wpull.pipeline.session.ItemSession, error=None)[source]¶ Return the wait time in seconds between requests.
-
handle_document(item_session: wpull.pipeline.session.ItemSession, filename: str) → wpull.application.hook.Actions[source]¶ Process a successful document response.
Returns: A value from hook.Actions.
-
handle_document_error(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for when the document only describes an server error.
Returns: A value from hook.Actions.
-
handle_error(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]¶ Process an error.
Returns: A value from hook.Actions.
-
handle_intermediate_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for successful intermediate responses.
Returns: A value from hook.Actions.
-
handle_no_document(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Callback for successful responses containing no useful document.
Returns: A value from hook.Actions.
-
handle_pre_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Process a response that is starting.
-
handle_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Generic handler for a response.
Returns: A value from hook.Actions.
-
static
plugin_handle_error(item_session: wpull.pipeline.session.ItemSession, error: BaseException) → wpull.application.hook.Actions[source]¶ Return an action to handle the error.
Parameters: - item_session –
- error –
Returns: A value from
Actions. The default isActions.NORMAL.
-
static
plugin_handle_pre_response(item_session: wpull.pipeline.session.ItemSession) → wpull.application.hook.Actions[source]¶ Return an action to handle a response status before a download.
Parameters: item_session – Returns: A value from Actions. The default isActions.NORMAL.
processor.web Module¶
Web processing.
-
exception
wpull.processor.web.HookPreResponseBreak[source]¶ Bases:
wpull.errors.ProtocolErrorHook pre-response break.
-
class
wpull.processor.web.WebProcessor(web_client: wpull.protocol.http.web.WebClient, fetch_params: wpull.processor.web.WebProcessorFetchParamsType)[source]¶ Bases:
wpull.processor.base.BaseProcessor,wpull.application.hook.HookableMixinHTTP processor.
Parameters: - web_client – The web client.
- fetch_params – Fetch parameters
See also
-
DOCUMENT_STATUS_CODES= (200, 204, 206, 304)¶ Default status codes considered successfully fetching a document.
-
NO_DOCUMENT_STATUS_CODES= (401, 403, 404, 405, 410)¶ Default status codes considered a permanent error.
-
fetch_params¶ The fetch parameters.
-
web_client¶ The web client.
-
wpull.processor.web.WebProcessorFetchParams¶ WebProcessorFetchParams
Parameters: - post_data (str) – If provided, all requests will be POSTed with the given post_data. post_data must be in percent-encoded query format (“application/x-www-form-urlencoded”).
- strong_redirects (bool) – If True, redirects are allowed to span hosts.
alias of
WebProcessorFetchParamsType
-
class
wpull.processor.web.WebProcessorSession(processor: wpull.processor.web.WebProcessor, item_session: wpull.pipeline.session.ItemSession)[source]¶ Bases:
wpull.processor.base.BaseProcessorSessionFetches an HTTP document.
This Processor Session will handle document redirects within the same Session. HTTP errors such as 404 are considered permanent errors. HTTP errors like 500 are considered transient errors and are handled in subsequence sessions by marking the item as “error”.
If a successful document has been downloaded, it will be scraped for URLs to be added to the URL table. This Processor Session is very simple; it cannot handle JavaScript or Flash plugins.
protocol Module¶
protocol.abstract Module¶
Conversation abstractions.
protocol.abstract.client Module¶
Client abstractions
-
class
wpull.protocol.abstract.client.BaseClient(connection_pool: typing.Union=None)[source]¶ Bases:
typing.Generic,wpull.application.hook.HookableMixinBase client.
-
class
wpull.protocol.abstract.client.BaseSession(connection_pool)[source]¶ Bases:
wpull.application.hook.HookableMixinBase session.
-
exception
wpull.protocol.abstract.client.DurationTimeout[source]¶ Bases:
wpull.errors.NetworkTimedOutDownload did not complete within specified time.
protocol.abstract.request Module¶
Request object abstractions
-
class
wpull.protocol.abstract.request.BaseResponse[source]¶ Bases:
wpull.protocol.abstract.request.ProtocolResponseMixin
-
class
wpull.protocol.abstract.request.ProtocolResponseMixin[source]¶ Bases:
objectProtocol abstraction for response objects.
-
protocol¶ Name of the protocol.
Returns: Either ftporhttp.Return type: str
-
protocol.abstract.stream Module¶
Abstract stream classes
protocol.ftp Module¶
File transfer protocol.
protocol.ftp.client Module¶
FTP client.
-
class
wpull.protocol.ftp.client.Client(*args, **kwargs)[source]¶ Bases:
wpull.protocol.abstract.client.BaseClientFTP Client.
The session object is
Session.
-
class
wpull.protocol.ftp.client.Session(login_table: weakref.WeakKeyDictionary, **kwargs)[source]¶ Bases:
wpull.protocol.abstract.client.BaseSession-
Session.download(file: typing.Union=None, rewind: bool=True, duration_timeout: typing.Union=None) → wpull.protocol.ftp.request.Response[source]¶ Read the response content into file.
Parameters: - file – A file object or asyncio stream.
- rewind – Seek the given file back to its original offset after reading is finished.
- duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns: A Response populated with the final data connection reply.
Be sure to call
start()first.Coroutine.
-
Session.download_listing(file: typing.Union, duration_timeout: typing.Union=None) → wpull.protocol.ftp.request.ListingResponse[source]¶ Read file listings.
Parameters: - file – A file object or asyncio stream.
- duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns: A Response populated the file listings
Be sure to call
start_file_listing()first.Coroutine.
-
Session.start(request: wpull.protocol.ftp.request.Request) → wpull.protocol.ftp.request.Response[source]¶ Start a file or directory listing download.
Parameters: request – Request. Returns: A Response populated with the initial data connection reply. Once the response is received, call
download().Coroutine.
-
Session.start_listing(request: wpull.protocol.ftp.request.Request) → wpull.protocol.ftp.request.ListingResponse[source]¶ Fetch a file listing.
Parameters: request – Request. Returns: A listing response populated with the initial data connection reply. Once the response is received, call
download_listing().Coroutine.
-
protocol.ftp.command Module¶
FTP service control.
-
class
wpull.protocol.ftp.command.Commander(data_stream)[source]¶ Bases:
objectHelper class that performs typical FTP routines.
Parameters: control_stream ( ftp.stream.ControlStream) – The control stream.-
begin_stream(command: wpull.protocol.ftp.request.Command) → wpull.protocol.ftp.request.Reply[source]¶ Start sending content on the data stream.
Parameters: - command – A command that tells the server to send data over the
- connection. (data) –
Coroutine.
Returns: The begin reply.
-
passive_mode() → typing.Tuple[source]¶ Enable passive mode.
Returns: The address (IP address, port) of the passive port. Coroutine.
-
classmethod
raise_if_not_match(action: str, expected_code: typing.Union, reply: wpull.protocol.ftp.request.Reply)[source]¶ Raise FTPServerError if not expected reply code.
Parameters: - action – Label to use in the exception message.
- expected_code – Expected 3 digit code.
- reply – Reply from the server.
-
read_stream(file: typing.IO, data_stream: wpull.protocol.ftp.stream.DataStream) → wpull.protocol.ftp.request.Reply[source]¶ Read from the data stream.
Parameters: - file – A destination file object or a stream writer.
- data_stream – The stream of which to read from.
Coroutine.
Returns: The final reply. Return type: Reply
-
setup_data_stream(connection_factory: typing.Callable, data_stream_factory: typing.Callable=<class 'wpull.protocol.ftp.stream.DataStream'>) → wpull.protocol.ftp.stream.DataStream[source]¶ Create and setup a data stream.
This function will set up passive and binary mode and handle connecting to the data connection.
Parameters: - connection_factory – A coroutine callback that returns a connection
- data_stream_factory – A callback that returns a data stream
Coroutine.
Returns: DataStream
-
protocol.ftp.ls Module¶
I-tried-my-best LIST parsing package.
protocol.ftp.ls.date Module¶
Date and time parsing
-
wpull.protocol.ftp.ls.date.AM_STRINGS= {'vorm', 'पूर्व', 'a. m', 'am', '午前', '上午', 'ص'}¶ Set of AM day period strings.
-
wpull.protocol.ftp.ls.date.DAY_PERIOD_PATTERN= re.compile('(nachm|vorm|م|पूर्व|a. m|午後|अपर|pm|下午|am|p. m|午前|上午|ص)\\b', re.IGNORECASE)¶ Regex pattern for AM/PM string.
-
wpull.protocol.ftp.ls.date.ISO_8601_DATE_PATTERN= re.compile('(\\d{4})(?!\\d)[\\w./-](\\d{1,2})(?!\\d)[\\w./-](\\d{1,2})')¶ Regex pattern for dates similar to YYYY-MM-DD.
-
wpull.protocol.ftp.ls.date.MMM_DD_YY_PATTERN= re.compile('([^\\W\\d_]{3,4})\\s{0,4}(\\d{1,2})\\s{0,4}(\\d{0,4})')¶ Regex pattern for dates similar to MMM DD YY.
Example: Feb 09 90
-
wpull.protocol.ftp.ls.date.MONTH_MAP= {'أبريل': 4, 'juni': 6, 'set': 9, 'lis': 11, 'juil': 7, 'lip': 7, 'aug': 8, 'sie': 8, 'जन': 1, 'يوليو': 7, 'नवं': 11, 'lut': 2, 'oct': 10, '7月': 7, 'juin': 6, 'فبراير': 2, '3月': 3, 'dec': 12, 'मार्च': 3, 'अक्टू': 10, 'sty': 1, 'जुला': 7, 'juli': 7, 'أكتوبر': 10, 'марта': 3, 'jan': 1, 'янв': 1, 'нояб': 11, 'ديسمبر': 12, 'apr': 4, 'अग': 8, 'août': 8, 'ago': 8, 'июня': 6, 'окт': 10, 'févr': 2, 'मई': 5, '8月': 8, 'ene': 1, 'сент': 9, 'نوفمبر': 11, '9月': 9, 'nov': 11, '5月': 5, '10月': 10, 'jul': 7, 'يناير': 1, 'जून': 6, 'mars': 3, 'déc': 12, 'dez': 12, 'dic': 12, 'okt': 10, 'апр': 4, 'avr': 4, 'mai': 5, 'gru': 12, '6月': 6, 'июля': 7, '12月': 12, 'wrz': 9, 'out': 10, 'авг': 8, 'फ़र': 2, 'мая': 5, 'февр': 2, 'سبتمبر': 9, 'feb': 2, 'अप्रै': 4, 'maj': 5, 'fev': 2, 'مارس': 3, '1月': 1, 'may': 5, 'mar': 3, '4月': 4, 'jun': 6, 'दिसं': 12, 'paź': 10, 'sep': 9, 'kwi': 4, '11月': 11, '2月': 2, 'abr': 4, 'सितं': 9, 'märz': 3, 'مايو': 5, 'أغسطس': 8, 'sept': 9, 'janv': 1, 'дек': 12, 'cze': 6, 'يونيو': 6}¶ Month names to int.
-
wpull.protocol.ftp.ls.date.NN_NN_NNNN_PATTERN= re.compile('(\\d{1,2})[./-](\\d{1,2})[./-](\\d{2,4})')¶ Regex pattern for dates similar to NN NN YYYY.
Example: 2/9/90
-
wpull.protocol.ftp.ls.date.PM_STRINGS= {'nachm', 'م', '午後', 'अपर', 'pm', '下午', 'p. m'}¶ Set of PM day period strings.
-
wpull.protocol.ftp.ls.date.TIME_PATTERN= re.compile('(\\d{1,2}):(\\d{2}):?(\\d{0,2})\\s?(nachm|vorm|م|पूर्व|a. m|午後|अपर|pm|下午|am|p. m|午前|上午|ص|\x08)?')¶ Regex pattern for time in HH:MM[:SS]
-
wpull.protocol.ftp.ls.date.guess_datetime_format(lines: typing.Iterable, threshold: int=5) → typing.Tuple[source]¶ Guess whether order of the year, month, day and 12/24 hour.
Returns: First item is either str ymd,dmy,mdyorNone. Second item is either True for 12-hour time or False for 24-hour time or None.Return type: tuple
-
wpull.protocol.ftp.ls.date.parse_cldr_json(directory, language_codes=('zh', 'es', 'en', 'hi', 'ar', 'pt', 'ru', 'ja', 'de', 'fr', 'pl'), massage=True)[source]¶ Parse CLDR JSON datasets to for date time things.
protocol.ftp.ls.listing Module¶
Listing parser.
-
wpull.protocol.ftp.ls.listing.FileEntry¶ A row in a listing.
-
wpull.protocol.ftp.ls.listing.name¶ str
Filename.
-
wpull.protocol.ftp.ls.listing.type¶ str, None
file,dir,symlink,other,None
-
wpull.protocol.ftp.ls.listing.size¶ int, None
Size of file.
-
wpull.protocol.ftp.ls.listing.date¶ datetime.datetime, NoneA datetime object in UTC.
-
wpull.protocol.ftp.ls.listing.dest¶ str, None
Destination filename for symlinks.
-
wpull.protocol.ftp.ls.listing.perm¶ int, None
Unix permissions expressed as an integer.
alias of
FileEntryType-
-
class
wpull.protocol.ftp.ls.listing.LineParser[source]¶ Bases:
objectParse individual lines in a listing.
-
exception
wpull.protocol.ftp.ls.listing.ListingError[source]¶ Bases:
ValueErrorError during parsing a listing.
-
class
wpull.protocol.ftp.ls.listing.ListingParser(text=None, file=None)[source]¶ Bases:
wpull.protocol.ftp.ls.listing.LineParserListing parser.
Parameters: - text (str) – A text listing.
- file – A file object in text mode containing the listing.
-
exception
wpull.protocol.ftp.ls.listing.UnknownListingError[source]¶ Bases:
wpull.protocol.ftp.ls.listing.ListingErrorFailed to determine type of listing.
-
wpull.protocol.ftp.ls.listing.guess_listing_type(lines, threshold=100)[source]¶ Guess the style of directory listing.
Returns: unix,msdos,nlst,unknown.Return type: str
protocol.ftp.request Module¶
FTP conversation classes
-
class
wpull.protocol.ftp.request.Command(name=None, argument='')[source]¶ Bases:
wpull.protocol.abstract.request.SerializableMixin,wpull.protocol.abstract.request.DictableMixinFTP request command.
Encoding is UTF-8.
-
name¶ str
The command. Usually 4 characters or less.
-
argument¶ str
Optional argument for the command.
-
name
-
-
class
wpull.protocol.ftp.request.ListingResponse[source]¶ Bases:
wpull.protocol.ftp.request.ResponseFTP response for a file listing.
-
files¶ list
A list of
ftp.ls.listing.FileEntry
-
-
class
wpull.protocol.ftp.request.Reply(code=None, text=None)[source]¶ Bases:
wpull.protocol.abstract.request.SerializableMixin,wpull.protocol.abstract.request.DictableMixinFTP reply.
Encoding is always UTF-8.
-
code¶ int
Reply code.
-
text¶ str
Reply message.
-
-
class
wpull.protocol.ftp.request.Request(url)[source]¶ Bases:
wpull.protocol.abstract.request.BaseRequest,wpull.protocol.abstract.request.URLPropertyMixinFTP request for a file.
-
address¶ tuple
Address of control connection.
-
data_address¶ tuple
Address of data connection.
-
username¶ str, None
Username for login.
-
password¶ str, None
Password for login.
-
restart_value¶ int, None
Optional value for
RESTcommand.
-
file_path¶ str
Path of the file.
-
file_path
-
-
class
wpull.protocol.ftp.request.Response[source]¶ Bases:
wpull.protocol.abstract.request.BaseResponse,wpull.protocol.abstract.request.DictableMixinFTP response for a file.
-
file_transfer_size¶ int
Size of the file transfer without considering restart. (REST is issued last.)
This is will be the file size. (STREAM mode is always used.)
-
restart_value¶ int
Offset value of restarted transfer.
-
protocol¶
-
protocol.ftp.stream Module¶
FTP Streams
-
class
wpull.protocol.ftp.stream.ControlStream(connection: wpull.network.connection.Connection)[source]¶ Bases:
objectStream class for a control connection.
Parameters: connection – Connection. -
data_event_dispatcher¶
-
read_reply() → wpull.protocol.ftp.request.Reply[source]¶ Read a reply from the stream.
Returns: The reply Return type: ftp.request.Reply Coroutine.
-
protocol.ftp.util Module¶
Utils
-
exception
wpull.protocol.ftp.util.FTPServerError[source]¶ Bases:
wpull.errors.ServerError-
reply_code¶ Return reply code.
-
-
class
wpull.protocol.ftp.util.ReplyCodes[source]¶ Bases:
object-
bad_sequence_of_commands= 503¶
-
cant_open_data_connection= 425¶
-
closing_data_connection= 226¶
-
command_not_implemented= 502¶
-
command_not_implemented_for_that_parameter= 504¶
-
command_not_implemented_superfluous_at_this_site= 202¶
-
command_okay= 200¶
-
connection_closed_transfer_aborted= 426¶
-
data_connection_already_open_transfer_starting= 125¶
-
data_connection_open_no_transfer_in_progress= 225¶
-
directory_status= 212¶
-
entering_passive_mode= 227¶
-
file_status= 213¶
-
file_status_okay_about_to_open_data_connection= 150¶
-
help_message= 214¶
-
name_system_type= 215¶
-
need_account_for_login= 332¶
-
need_account_for_storing_files= 532¶
-
not_logged_in= 530¶
-
pathname_created= 257¶
-
requested_action_aborted_local_error_in_processing= 451¶
-
requested_action_aborted_page_type_unknown= 551¶
-
requested_action_not_taken_file_name_not_allowed= 553¶
-
requested_action_not_taken_insufficient_storage_space= 452¶
-
requested_file_action_aborted= 552¶
-
requested_file_action_not_taken= 450¶
-
requested_file_action_okay_completed= 250¶
-
requested_file_action_pending_further_information= 350¶
-
restart_marker_reply= 110¶
-
service_closing_control_connection= 221¶
-
service_not_available_closing_control_connection= 421¶
-
service_ready_for_new_user= 220¶
-
service_ready_in_nnn_minutes= 120¶
-
syntax_error_command_unrecognized= 500¶
-
syntax_error_in_parameters_or_arguments= 501¶
-
system_status_or_system_help_reply= 211¶
-
user_logged_in_proceed= 230¶
-
user_name_okay_need_password= 331¶
-
-
wpull.protocol.ftp.util.convert_machine_list_time_val(text: str) → datetime.datetime[source]¶ Convert RFC 3659 time-val to datetime objects.
-
wpull.protocol.ftp.util.convert_machine_list_value(name: str, value: str) → typing.Union[source]¶ Convert sizes and time values.
Size will be
intwhile time value will bedatetime.datetime.
-
wpull.protocol.ftp.util.machine_listings_to_file_entries(listings: typing.Iterable) → typing.Iterable[source]¶ Convert results from parsing machine listings to FileEntry list.
-
wpull.protocol.ftp.util.parse_machine_listing(text: str, convert: bool=True, strict: bool=True) → typing.List[source]¶ Parse machine listing.
Parameters: - text – The listing.
- convert – Convert sizes and dates.
- strict – Method of handling errors.
Truewill raiseValueError.Falsewill ignore rows with errors.
Returns: A list of dict of the facts defined in RFC 3659. The key names must be lowercase. The filename uses the key
name.Return type: list
protocol.http Module¶
HTTP Protocol.
protocol.http.chunked Module¶
Chunked transfer encoding.
-
class
wpull.protocol.http.chunked.ChunkedTransferReader(connection, read_size=4096)[source]¶ Bases:
objectRead chunked transfer encoded stream.
Parameters: connection ( connection.Connection) – Established connection.-
read_chunk_body()[source]¶ Read a fragment of a single chunk.
Call
read_chunk_header()first.Returns: 2-item tuple with the content data and raw data. First item is empty bytes string when chunk is fully read. Return type: tuple Coroutine.
-
protocol.http.client Module¶
Basic HTTP Client.
-
class
wpull.protocol.http.client.Client(*args, stream_factory=<class 'wpull.protocol.http.stream.Stream'>, **kwargs)[source]¶ Bases:
wpull.protocol.abstract.client.BaseClientStateless HTTP/1.1 client.
The session object is
Session.
-
class
wpull.protocol.http.client.Session(stream_factory: typing.Callable=None, **kwargs)[source]¶ Bases:
wpull.protocol.abstract.client.BaseSessionHTTP request and response session.
-
Session.done() → bool[source]¶ Return whether the session was complete.
A session is complete when it has sent a request, read the response header and the response body.
-
Session.download(file: typing.Union=None, raw: bool=False, rewind: bool=True, duration_timeout: typing.Union=None)[source]¶ Read the response content into file.
Parameters: - file – A file object or asyncio stream.
- raw – Whether chunked transfer encoding should be included.
- rewind – Seek the given file back to its original offset after reading is finished.
- duration_timeout – Maximum time in seconds of which the entire file must be read.
Be sure to call
start()first.Coroutine.
-
Session.start(request: wpull.protocol.http.request.Request) → wpull.protocol.http.request.Response[source]¶ Begin a HTTP request
Parameters: request – Request information. Returns: A response populated with the HTTP headers. Once the headers are received, call
download().Coroutine.
-
protocol.http.redirect Module¶
Redirection tracking.
-
class
wpull.protocol.http.redirect.RedirectTracker(max_redirects=20, codes=(301, 302, 303), repeat_codes=(307, 308))[source]¶ Bases:
objectKeeps track of HTTP document URL redirects.
Parameters: - max_redirects (int) – The maximum number of redirects to allow.
- codes – The HTTP status codes indicating a redirect where the method can change to “GET”.
- repeat_codes – The HTTP status codes indicating a redirect where the method cannot change and future requests should be repeated.
-
REDIRECT_CODES= (301, 302, 303)¶
-
REPEAT_REDIRECT_CODES= (307, 308)¶
-
load(response)[source]¶ Load the response and increment the counter.
Parameters: response ( http.request.Response) – The response from a previous request.
-
next_location(raw=False)[source]¶ Returns the next location.
Parameters: raw (bool) – If True, the original string contained in the Location field will be returned. Otherwise, the URL will be normalized to a complete URL. Returns: If str, the location. Otherwise, no next location. Return type: str, None
protocol.http.request Module¶
HTTP conversation objects.
-
class
wpull.protocol.http.request.RawRequest(method=None, resource_path=None, version='HTTP/1.1')[source]¶ Bases:
wpull.protocol.abstract.request.BaseRequest,wpull.protocol.abstract.request.SerializableMixin,wpull.protocol.abstract.request.DictableMixinRepresents an HTTP request.
-
method¶ str
The HTTP method in the status line. For example,
GET,POST.
-
resource_path¶ str
The URL or “path” in the status line.
-
version¶ str
The HTTP version in the status line. For example,
HTTP/1.0.
-
fields¶ -
The fields in the HTTP header.
-
encoding¶ str
The encoding of the status line.
-
-
class
wpull.protocol.http.request.Request(url=None, method='GET', version='HTTP/1.1')[source]¶ Bases:
wpull.protocol.http.request.RawRequestRepresents a higher level of HTTP request.
-
address¶ tuple
An address tuple suitable for
socket.connect().
-
username¶ str
Username for HTTP authentication.
-
password¶ str
Password for HTTP authentication.
-
-
class
wpull.protocol.http.request.Response(status_code=None, reason=None, version='HTTP/1.1', request=None)[source]¶ Bases:
wpull.protocol.abstract.request.BaseResponse,wpull.protocol.abstract.request.SerializableMixin,wpull.protocol.abstract.request.DictableMixinRepresents the HTTP response.
-
status_code¶ int
The status code in the status line.
-
status_reason¶ str
The status reason string in the status line.
-
version¶ str
The HTTP version in the status line. For example,
HTTP/1.1.
-
fields¶ -
The fields in the HTTP headers (and trailer, if present).
-
request¶ The corresponding request.
-
encoding¶ str
The encoding of the status line.
-
classmethod
parse_status_line(data)[source]¶ Parse the status line bytes.
Returns: An tuple representing the version, code, and reason. Return type: tuple
-
protocol¶
-
protocol.http.robots Module¶
Robots.txt file logistics.
-
exception
wpull.protocol.http.robots.NotInPoolError[source]¶ Bases:
ExceptionThe URL is not in the pool.
-
class
wpull.protocol.http.robots.RobotsTxtChecker(web_client: wpull.protocol.http.web.WebClient=None, robots_txt_pool: wpull.robotstxt.RobotsTxtPool=None)[source]¶ Bases:
objectRobots.txt file fetcher and checker.
Parameters: - web_client – Web Client.
- robots_txt_pool – Robots.txt Pool.
-
can_fetch(request: wpull.protocol.http.request.Request, file=None) → bool[source]¶ Return whether the request can fetched.
Parameters: - request – Request.
- file – A file object to where the robots.txt contents are written.
Coroutine.
-
can_fetch_pool(request: wpull.protocol.http.request.Request)[source]¶ Return whether the request can be fetched based on the pool.
-
fetch_robots_txt(request: wpull.protocol.http.request.Request, file=None)[source]¶ Fetch the robots.txt file for the request.
Coroutine.
-
robots_txt_pool¶ Return the RobotsTxtPool.
-
web_client¶ Return the WebClient.
protocol.http.stream Module¶
HTML protocol streamers.
-
wpull.protocol.http.stream.DEFAULT_NO_CONTENT_CODES= frozenset({100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 204, 304})¶ Status codes where a response body is prohibited.
-
class
wpull.protocol.http.stream.Stream(connection, keep_alive=True, ignore_length=False)[source]¶ Bases:
objectHTTP stream reader/writer.
Parameters: - connection (
connection.Connection) – An established connection. - keep_alive (bool) – If True, use HTTP keep-alive.
- ignore_length (bool) – If True, Content-Length headers will be ignored. When using this option, keep_alive should be False.
-
connection¶ The underlying connection.
-
connection
-
data_event_dispatcher¶
-
classmethod
get_read_strategy(response)[source]¶ Return the appropriate algorithm of reading response.
Returns: chunked,length,close.Return type: str
-
read_body(request, response, file=None, raw=False)[source]¶ Read the response’s content body.
Coroutine.
- connection (
-
wpull.protocol.http.stream.is_no_body(request, response, no_content_codes=frozenset({100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168, 169, 170, 171, 172, 173, 174, 175, 176, 177, 178, 179, 180, 181, 182, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 204, 304}))[source]¶ Return whether a content body is not expected.
protocol.http.util Module¶
Miscellaneous HTTP functions.
protocol.http.web Module¶
Advanced HTTP Client handling.
-
class
wpull.protocol.http.web.LoopType[source]¶ Bases:
enum.EnumIndicates the type of request and response.
-
authentication= None¶ Response to a HTTP authentication.
-
normal= None¶ Normal response.
-
redirect= None¶ Redirect.
-
robots= None¶ Response to a robots.txt request.
-
-
class
wpull.protocol.http.web.WebClient(http_client: typing.Union=None, request_factory: typing.Callable=<class 'wpull.protocol.http.request.Request'>, redirect_tracker_factory: typing.Union=<class 'wpull.protocol.http.redirect.RedirectTracker'>, cookie_jar: typing.Union=None)[source]¶ Bases:
objectA web client handles redirects, cookies, basic authentication.
Parameters: - An HTTP client. (http_client.) –
- requets_factory – A function that returns a new
http.request.Request - redirect_tracker_factory – A function that returns a new
http.redirect.RedirectTracker - cookie_jar – A cookie jar.
Return the Cookie Jar.
-
http_client¶ Return the HTTP Client.
-
redirect_tracker_factory¶ Return the Redirect Tracker factory.
-
request_factory¶ Return the Request factory.
-
session(request: wpull.protocol.http.request.Request) → wpull.protocol.http.web.WebSession[source]¶ Return a fetch session.
Parameters: request – The request to be fetched. Example usage:
client = WebClient() session = client.session(Request('http://www.example.com')) with session: while not session.done(): request = session.next_request() print(request) response = yield from session.start() print(response) if session.done(): with open('myfile.html') as file: yield from session.download(file) else: yield from session.download()
Returns: WebSession
-
class
wpull.protocol.http.web.WebSession(request: wpull.protocol.http.request.Request, http_client: wpull.protocol.http.client.Client, redirect_tracker: wpull.protocol.http.redirect.RedirectTracker, request_factory: typing.Callable, cookie_jar: typing.Union=None)[source]¶ Bases:
objectA web session.
-
done() → bool[source]¶ Return whether the session has finished.
Returns: If True, the document has been fully fetched. Return type: bool
-
download(file: typing.Union=None, duration_timeout: typing.Union=None)[source]¶ Download content.
Parameters: - file – An optional file object for the document contents.
- duration_timeout – Maximum time in seconds of which the entire file must be read.
Returns: An instance of
http.request.Response.Return type: See
WebClient.session()for proper usage of this function.Coroutine.
-
loop_type() → wpull.protocol.http.web.LoopType[source]¶ Return the type of response.
Seealso: LoopType.
-
redirect_tracker¶ Return the Redirect Tracker.
-
proxy Module¶
proxy.client Module¶
Proxy support for HTTP requests.
-
class
wpull.proxy.client.HTTPProxyConnectionPool(proxy_address, *args, proxy_ssl=False, authentication=None, ssl_context=True, host_filter=None, **kwargs)[source]¶ Bases:
wpull.network.pool.ConnectionPoolEstablish pooled connections to a HTTP proxy.
Parameters: - proxy_address (tuple) – Tuple containing host and port of the proxy server.
- connection_pool (
connection.ConnectionPool) – Connection pool - proxy_ssl (bool) – Whether to connect to the proxy using HTTPS.
- authentication (tuple) – Tuple containing username and password.
- ssl_context – SSL context for SSL connections on TCP tunnels.
- host_filter (
proxy.hostfilter.HostFilter) – Host filter which for deciding whether a connection is routed through the proxy. A test result that returns True is routed through the proxy.
proxy.hostfilter Module¶
Host filtering.
proxy.server Module¶
Proxy Tools
-
class
wpull.proxy.server.HTTPProxyServer(http_client: wpull.protocol.http.client.Client)[source]¶ Bases:
wpull.application.hook.HookableMixinHTTP proxy server for use with man-in-the-middle recording.
This function is meant to be used as a callback:
asyncio.start_server(HTTPProxyServer(HTTPClient))
Parameters: http_client ( http.client.Client) – The HTTP client.-
request_callback¶ A callback function that accepts a Request.
-
pre_response_callback¶ A callback function that accepts a Request and Response
-
response_callback¶ A callback function that accepts a Request and Response
-
regexstream Module¶
Regular expression streams.
-
class
wpull.regexstream.RegexStream(file, pattern, read_size=16384, overlap_size=4096)[source]¶ Bases:
objectStreams file with regular expressions.
Parameters: - file – File object.
- pattern – A compiled regular expression object.
- read_size (int) – The size of a chunk of text that is searched.
- overlap_size (int) – The amount of overlap between chunks of text that is searched.
resmon Module¶
Resource monitor.
-
wpull.resmon.ResourceInfo¶ Resource level information
-
wpull.resmon.path¶ str, None
File path of the resource.
Noneis provided for memory usage.
-
wpull.resmon.free¶ int
Number of bytes available.
-
wpull.resmon.limit¶ int
Minimum bytes of the resource.
alias of
ResourceInfoType-
-
class
wpull.resmon.ResourceMonitor(resource_paths=('/', ), min_disk=10000, min_memory=10000)[source]¶ Bases:
objectMonitor available resources such as disk space and memory.
Parameters: - resource_paths (list) – List of paths to monitor. Recommended paths include temporary directories and the current working directory.
- min_disk (int, optional) – Minimum disk space in bytes.
- min_memory (int, optional) – Minimum memory in bytes.
robotstxt Module¶
Robots.txt exclusion directives.
scraper Module¶
Document scrapers.
scraper.base Module¶
Base classes
-
class
wpull.scraper.base.BaseExtractiveScraper[source]¶ Bases:
wpull.scraper.base.BaseScraper,wpull.document.base.BaseExtractiveReader
-
class
wpull.scraper.base.BaseHTMLScraper[source]¶ Bases:
wpull.scraper.base.BaseScraper,wpull.document.base.BaseHTMLReader
-
class
wpull.scraper.base.BaseScraper[source]¶ Bases:
objectBase class for scrapers.
-
scrape(request, response, link_type=None)[source]¶ Extract the URLs from the document.
Parameters: - request (
http.request.Request) – The request. - response (
http.request.Response) – The response. - link_type – A value from
item.LinkType.
Returns: LinkContexts and document information.
If None, then the scraper does not support scraping the document.
Return type: ScrapeResult, None
- request (
-
-
class
wpull.scraper.base.BaseTextStreamScraper[source]¶ Bases:
wpull.scraper.base.BaseScraper,wpull.document.base.BaseTextStreamReaderBase class for scrapers that process either link and non-link text.
-
iter_processed_links(file, encoding=None, base_url=None, context=False)[source]¶ Return the links.
This function is a convenience function for calling
iter_processed_text()and returning only the links.
-
iter_processed_text(file, encoding=None, base_url=None)[source]¶ Return the file text and processed absolute links.
Parameters: - file – A file object containing the document.
- encoding (str) – The encoding of the document.
- base_url (str) – The URL at which the document is located.
Returns: Each item is a tuple:
- str: The text
- bool: Whether the text a link
Return type: iterator
-
-
class
wpull.scraper.base.DemuxDocumentScraper(document_scrapers)[source]¶ Bases:
wpull.scraper.base.BaseScraperPuts multiple Document Scrapers into one.
-
scrape(request, response, link_type=None)[source]¶ Iterate the scrapers, returning the first of the results.
-
scrape_info(request, response, link_type=None)[source]¶ Iterate the scrapers and return a dict of results.
Returns: A dict where the keys are the scrapers instances and the values are the results. That is, a mapping from BaseDocumentScrapertoScrapeResult.Return type: dict
-
-
wpull.scraper.base.LinkContext¶ A named tuple describing a scraped link.
-
wpull.scraper.base.link¶ str
The link that was scraped.
-
wpull.scraper.base.inline¶ bool
Whether the link is an embeded object.
-
wpull.scraper.base.linked¶ bool
Whether the link links to another page.
-
wpull.scraper.base.link_type¶ A value from
item.LinkType.
-
wpull.scraper.base.extra¶ Any extra info.
alias of
LinkContextType-
-
class
wpull.scraper.base.ScrapeResult(link_contexts, encoding)[source]¶ Bases:
dictLinks scraped from a document.
This class is subclassed from
dictand contains convenience methods.-
encoding¶ Character encoding of the document.
-
inline¶ Link Context of objects embedded in the document.
-
inline_links¶ URLs of objects embedded in the document.
-
link_contexts¶ Link Contexts.
-
linked¶ Link Context of objects linked from the document
-
linked_links¶ URLs of objects linked from the document
-
scraper.css Module¶
Stylesheet scraper.
-
class
wpull.scraper.css.CSSScraper(encoding_override=None)[source]¶ Bases:
wpull.document.css.CSSReader,wpull.scraper.base.BaseTextStreamScraperScrapes CSS stylesheet documents.
scraper.html Module¶
HTML link extractor.
-
class
wpull.scraper.html.ElementWalker(css_scraper=None, javascript_scraper=None)[source]¶ Bases:
object-
ATTR_HTML= 2¶ Flag for links that point to other documents.
-
ATTR_INLINE= 1¶ Flag for embedded objects (like images, stylesheets) in documents.
-
DYNAMIC_ATTRIBUTES= ('onkey', 'oncli', 'onmou')¶ Attributes that contain JavaScript.
-
LINK_ATTRIBUTES= frozenset({'usemap', 'data', 'href', 'profile', 'action', 'dynsrc', 'classid', 'codebase', 'cite', 'longdesc', 'lowsrc', 'archive', 'background', 'src'})¶ HTML element attributes that may contain links.
-
OPEN_GRAPH_LINK_NAMES= ('og:url', 'twitter:player')¶ Iterate elements looking for links.
Parameters: - css_scraper (
scraper.css.CSSScraper) – Optional CSS scraper. - ( (javascript_scraper) – class:`.scraper.javascript.JavaScriptScraper): Optional JavaScript scraper.
- css_scraper (
-
OPEN_GRAPH_MEDIA_NAMES= ('og:image', 'og:audio', 'og:video', 'twitter:image:src', 'twitter:image0', 'twitter:image1', 'twitter:image2', 'twitter:image3', 'twitter:player:stream')¶
-
TAG_ATTRIBUTES= {'bgsound': {'src': 1}, 'body': {'background': 1}, 'input': {'src': 1}, 'area': {'href': 2}, 'iframe': {'src': 3}, 'applet': {'code': 1}, 'script': {'src': 1}, 'embed': {'href': 2, 'src': 3}, 'overlay': {'src': 3}, 'a': {'href': 2}, 'object': {'data': 1}, 'form': {'action': 2}, 'table': {'background': 1}, 'th': {'background': 1}, 'td': {'background': 1}, 'layer': {'src': 3}, 'fig': {'src': 1}, 'frame': {'src': 3}, 'img': {'href': 1, 'lowsrc': 1, 'src': 1}}¶ Mapping of element tag names to attributes containing links.
-
classmethod
is_html_link(tag, attribute)[source]¶ Return whether the link is likely to be external object.
-
classmethod
is_link_inline(tag, attribute)[source]¶ Return whether the link is likely to be inline object.
-
iter_links(elements)[source]¶ Iterate the document root for links.
Returns: A iterator of LinkedInfo.Return type: iterable
-
iter_links_by_js_attrib(attrib_name, attrib_value)[source]¶ Iterate links of a JavaScript pseudo-link attribute.
-
iter_links_link_element(element)[source]¶ Iterate a
linkfor URLs.This function handles stylesheets and icons in addition to standard scraping rules.
-
classmethod
iter_links_meta_element(element)[source]¶ Iterate the
metaelement for links.This function handles refresh URLs.
-
-
class
wpull.scraper.html.HTMLScraper(html_parser, element_walker, followed_tags=None, ignored_tags=None, robots=False, only_relative=False, encoding_override=None)[source]¶ Bases:
wpull.document.html.HTMLReader,wpull.scraper.base.BaseHTMLScraperScraper for HTML documents.
Parameters: - (class (element_walker) – .document.htmlparse.base.BaseParser): An HTML parser such as the lxml or html5lib one.
- (class – ElementWalker): HTML element walker.
- followed_tags – A list of tags that should be scraped
- ignored_tags – A list of tags that should not be scraped
- robots – If True, discard any links if they cannot be followed
- only_relative – If True, discard any links that are not absolute paths
-
wpull.scraper.html.LinkInfo¶ Information about a link in a lxml document.
-
wpull.scraper.html.element¶ An instance of
document.HTMLReadElement.
-
wpull.scraper.html.tag¶ str
The element tag name.
-
wpull.scraper.html.attrib¶ str, None
If
str, the name of the attribute. Otherwise, the link was found inelement.text.
-
wpull.scraper.html.link¶ str
The link found.
-
wpull.scraper.html.inline¶ bool
Whether the link is an embedded object (like images or stylesheets).
-
wpull.scraper.html.linked¶ bool
Whether the link is a link to another page.
-
wpull.scraper.html.base_link¶ str, None
The base URL.
-
wpull.scraper.html.value_type¶ str
Indicates how the link was found. Possible values are
plain: The link was found plainly in an attribute value.list: The link was found in a space separated list.css: The link was found in a CSS text.refresh: The link was found in a refresh meta string.script: The link was found in JavaScript text.srcset: The link was found in asrcsetattribute.
-
wpull.scraper.html.link_type¶ A value from
item.LinkInfo.
alias of
LinkInfoType-
scraper.javascript Module¶
Javascript scraper.
-
class
wpull.scraper.javascript.JavaScriptScraper(encoding_override=None)[source]¶ Bases:
wpull.document.javascript.JavaScriptReader,wpull.scraper.base.BaseTextStreamScraperScrapes JavaScript documents.
scraper.sitemap Module¶
Sitemap scraper
-
class
wpull.scraper.sitemap.SitemapScraper(html_parser, encoding_override=None)[source]¶ Bases:
wpull.document.sitemap.SitemapReader,wpull.scraper.base.BaseExtractiveScraperScrape Sitemaps
scraper.util Module¶
Misc functions.
-
wpull.scraper.util.clean_link_soup(link)[source]¶ Strip whitespace from a link in HTML soup.
Parameters: link (str) – A string containing the link with lots of whitespace. The link is split into lines. For each line, leading and trailing whitespace is removed and tabs are removed throughout. The lines are concatenated and returned.
For example, passing the
hrefvalue of:<a href=" http://example.com/ blog/entry/ how smaug stole all the bitcoins.html ">will return
http://example.com/blog/entry/how smaug stole all the bitcoins.html.Returns: The cleaned link. Return type: str
-
wpull.scraper.util.identify_link_type(filename)[source]¶ Return link type guessed by filename extension.
Returns: A value from item.LinkType.Return type: str
-
wpull.scraper.util.is_likely_link(text)[source]¶ Return whether the text is likely to be a link.
This function assumes that leading/trailing whitespace has already been removed.
Returns: bool
stats Module¶
Statistics.
-
class
wpull.stats.Statistics(url_table: typing.Union=None)[source]¶ Bases:
objectStatistics.
-
start_time¶ float
Timestamp when the engine started.
-
stop_time¶ float
Timestamp when the engine stopped.
-
files¶ int
Number of files downloaded.
-
size¶ int
Size of files in bytes.
-
errors¶ a Counter mapping error types to integer.
-
quota¶ int
Threshold of number of bytes when the download quota is exceeded.
-
bandwidth_meter¶ network.BandwidthMeterThe bandwidth meter.
-
duration¶ Return the time in seconds the interval.
-
increment(size: int)[source]¶ Increment the number of files downloaded.
Parameters: size – The size of the file
-
is_quota_exceeded¶ Return whether the quota is exceeded.
-
string Module¶
String and binary data functions.
-
wpull.string.coerce_str_to_ascii(string)[source]¶ Force the contents of the string to be ASCII.
Anything not ASCII will be replaced with with a replacement character.
Deprecated since version 0.1002: Use
printable_str()instead.
-
wpull.string.detect_encoding(data, encoding=None, fallback='latin1', is_html=False)[source]¶ Detect the character encoding of the data.
Returns: The name of the codec
Return type: str
Raises: ValueError– The codec could not be detected. This error can only- occur if fallback is not a “lossless” codec.
-
wpull.string.format_size(num, format_str='{num:.1f} {unit}')[source]¶ Format the file size into a human readable text.
-
wpull.string.normalize_codec_name(name)[source]¶ Return the Python name of the encoder/decoder
Returns: str, None
-
wpull.string.printable_bytes(data)[source]¶ Remove any bytes that is not printable ASCII.
This function is intended for sniffing content types such as UTF-16 encoded text.
-
wpull.string.printable_str(text, keep_newlines=False)[source]¶ Escape any control or non-ASCII characters from string.
This function is intended for use with strings from an untrusted source such as writing to a console or writing to logs. It is designed to prevent things like ANSI escape sequences from showing.
Use
repr()orascii()instead for things such as Exception messages.
url Module¶
URL parsing based on WHATWG URL living standard.
-
wpull.url.C0_CONTROL_SET= frozenset({'\x02', '\x17', '\x1c', '\x01', '\x1f', '\x1a', '\x04', '\x15', '\x1b', '\x1e', '\x1d', '\x07', '\r', '\x05', '\x10', '\x0e', '\x16', '\x18', '\x00', '\n', '\x14', '\x06', '\x12', '\x0c', '\x13', '\x19', '\x08', '\x03', '\x0f', '\t', '\x0b', '\x11'})¶ Characters from 0x00 to 0x1f inclusive
-
wpull.url.DEFAULT_ENCODE_SET= frozenset({32, 96, 34, 35, 60, 62, 63})¶ Percent encoding set as defined by WHATWG URL living standard.
Does not include U+0000 to U+001F nor U+001F or above.
-
wpull.url.FORBIDDEN_HOSTNAME_CHARS= frozenset({'\\', ':', '#', ' ', '%', ']', '?', '@', '/', '['})¶ Forbidden hostname characters.
Does not include non-printing characters. Meant for ASCII.
-
wpull.url.FRAGMENT_ENCODE_SET= frozenset({32, 96, 34, 60, 62})¶ Encoding set for fragment.
-
wpull.url.PASSWORD_ENCODE_SET= frozenset({32, 96, 34, 35, 64, 47, 60, 92, 62, 63})¶ Encoding set for passwords.
-
class
wpull.url.PercentEncoderMap(encode_set)[source]¶ Bases:
collections.defaultdictHelper map for percent encoding.
-
wpull.url.QUERY_ENCODE_SET= frozenset({96, 34, 35, 60, 62})¶ Encoding set for query strings.
This set does not include U+0020 (space) so it can be replaced with U+0043 (plus sign) later.
-
wpull.url.QUERY_VALUE_ENCODE_SET= frozenset({96, 34, 35, 37, 38, 43, 60, 62})¶ Encoding set for a query value.
-
class
wpull.url.URLInfo[source]¶ Bases:
objectRepresent parts of a URL.
-
raw¶ str
Original string.
-
scheme¶ str
Protocol (for example, HTTP, FTP).
str
Raw userinfo and host.
-
path¶ str
Location of resource. This value always begins with a slash (
/).
-
query¶ str
Additional request parameters.
-
fragment¶ str
Named anchor of a document.
-
userinfo¶ str
Raw username and password.
-
username¶ str
Username.
-
password¶ str
Password.
-
host¶ str
Raw hostname and port.
-
hostname¶ str
Hostname or IP address.
-
port¶ int
IP address port number.
-
resource¶ int
Raw path, query, and fragment. This value always begins with a slash (
/).
-
query_map¶ dict
Mapping of the query. Values are lists.
-
url¶ str
A normalized URL without userinfo and fragment.
-
encoding¶ str
Codec name for IRI support.
If scheme is not something like HTTP or FTP, the remaining attributes are None.
All attributes are read only.
For more information about how the URL parts are derived, see https://medialize.github.io/URI.js/about-uris.html
-
authority
-
encoding
-
fragment
-
host
-
hostname
-
hostname_with_port¶ Return the host portion but omit default port if needed.
-
classmethod
parse(url, default_scheme='http', encoding='utf-8')[source]¶ Parse a URL and return a URLInfo.
Parse the authority part and return userinfo and host.
-
password
-
path
-
port
-
query
-
query_map
-
raw
-
resource
-
scheme
-
split_path()[source]¶ Return the directory and filename from the path.
The results are not percent-decoded.
-
url
-
userinfo
-
username
-
-
wpull.url.USERNAME_ENCODE_SET= frozenset({32, 96, 34, 35, 64, 47, 58, 60, 92, 62, 63})¶ Encoding set for usernames.
-
wpull.url.flatten_path(path, flatten_slashes=False)[source]¶ Flatten an absolute URL path by removing the dot segments.
urllib.parse.urljoin()has some support for removing dot segments, but it is conservative and only removes them as needed.Parameters: - path (str) – The URL path.
- flatten_slashes (bool) – If True, consecutive slashes are removed.
The path returned will always have a leading slash.
-
wpull.url.is_subdir(base_path, test_path, trailing_slash=False, wildcards=False)[source]¶ Return whether the a path is a subpath of another.
Parameters: - base_path – The base path
- test_path – The path which we are testing
- trailing_slash – If True, the trailing slash is treated with importance.
For example,
/images/is a directory while/imagesis a file. - wildcards – If True, globbing wildcards are matched against paths
-
wpull.url.normalize(url, **kwargs)[source]¶ Normalize a URL.
This function is a convenience function that is equivalent to:
>>> URLInfo.parse('http://example.com').url 'http://example.com'
Seealso: URLInfo.parse().
-
wpull.url.normalize_fragment(text, encoding='utf-8')[source]¶ Normalize a fragment.
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.normalize_hostname(hostname)[source]¶ Normalizes a hostname so that it is ASCII and valid domain name.
-
wpull.url.normalize_password(text, encoding='utf-8')[source]¶ Normalize a password
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.normalize_path(path, encoding='utf-8')[source]¶ Normalize a path string.
Flattens a path by removing dot parts, percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.normalize_query(text, encoding='utf-8')[source]¶ Normalize a query string.
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.normalize_username(text, encoding='utf-8')[source]¶ Normalize a username
Percent-encodes unacceptable characters and ensures percent-encoding is uppercase.
-
wpull.url.parse_url_or_log(url, encoding='utf-8')[source]¶ Parse and return a URLInfo.
This function logs a warning if the URL cannot be parsed and returns None.
-
wpull.url.percent_encode(text, encode_set=frozenset({32, 96, 34, 35, 60, 62, 63}), encoding='utf-8')[source]¶ Percent encode text.
Unlike Python’s
quote, this function accepts a blacklist instead of a whitelist of safe characters.
-
wpull.url.percent_encode_plus(text, encode_set=frozenset({96, 34, 35, 60, 62}), encoding='utf-8')[source]¶ Percent encode text for query strings.
Unlike Python’s
quote_plus, this function accepts a blacklist instead of a whitelist of safe characters.
-
wpull.url.query_to_map(text)[source]¶ Return a key-values mapping from a query string.
Plus symbols are replaced with spaces.
-
wpull.url.schemes_similar(scheme1, scheme2)[source]¶ Return whether URL schemes are similar.
This function considers the following schemes to be similar:
- HTTP and HTTPS
-
wpull.url.split_query(qs, keep_blank_values=False)[source]¶ Split the query string.
Note for empty values: If an equal sign (
=) is present, the value will be an empty string (''). Otherwise, the value will beNone:>>> list(split_query('a=&b', keep_blank_values=True)) [('a', ''), ('b', None)]
No processing is done on the actual values.
urlfilter Module¶
URL filters.
-
class
wpull.urlfilter.BackwardDomainFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterReturn whether the hostname matches a list of hostname suffixes.
-
class
wpull.urlfilter.BackwardFilenameFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that match the filename suffixes.
-
class
wpull.urlfilter.BaseURLFilter[source]¶ Bases:
objectBase class for URL filters.
The Processor uses filters to determine whether a URL should be downloaded.
-
class
wpull.urlfilter.DemuxURLFilter(url_filters: typing.Iterator)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterPuts multiple url filters into one.
-
test_info(url_info, url_table_record) → dict[source]¶ Returns info about which filters passed or failed.
Returns: A dict containing the keys: verdict(bool): Whether all the tests passed.passed(set): A set of URLFilters that passed.failed(set): A set of URLFilters that failed.map(dict): A mapping from URLFilter class name (str) to the verdict (bool).
Return type: dict
-
url_filters¶
-
-
class
wpull.urlfilter.DirectoryFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that match a directory path part.
-
class
wpull.urlfilter.FollowFTPFilter(follow=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFollow links to FTP URLs.
-
class
wpull.urlfilter.HTTPSOnlyFilter[source]¶ Bases:
wpull.urlfilter.BaseURLFilterAllow URL if the URL is HTTPS.
-
class
wpull.urlfilter.HostnameFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterReturn whether the hostname matches exactly in a list.
-
class
wpull.urlfilter.LevelFilter(max_depth, inline_max_depth=5)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterAllow URLs up to a level of recursion.
-
class
wpull.urlfilter.ParentFilter[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that descend up parent paths.
-
class
wpull.urlfilter.RecursiveFilter(enabled=False, page_requisites=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterReturn
Trueif recursion is used.
-
class
wpull.urlfilter.RegexFilter(accepted=None, rejected=None)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that match a regular expression.
-
class
wpull.urlfilter.SchemeFilter(allowed=('http', 'https', 'ftp'))[source]¶ Bases:
wpull.urlfilter.BaseURLFilterAllow URL if the URL is in list.
-
class
wpull.urlfilter.SpanHostsFilter(hostnames, enabled=False, page_requisites=False, linked_pages=False)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterFilter URLs that go to other hostnames.
-
class
wpull.urlfilter.TriesFilter(max_tries)[source]¶ Bases:
wpull.urlfilter.BaseURLFilterAllow URLs that have been attempted up to a limit of tries.
urlrewrite Module¶
URL rewriting.
util Module¶
Miscellaneous functions.
-
class
wpull.util.ASCIIStreamWriter(stream, errors='backslashreplace')[source]¶ Bases:
codecs.StreamWriterA Stream Writer that encodes everything to ASCII.
By default, the replacement character is a Python backslash sequence.
-
DEFAULT_ERROR= 'backslashreplace'¶
-
-
class
wpull.util.GzipPickleStream(filename=None, file=None, mode='rb', **kwargs)[source]¶ Bases:
wpull.util.PickleStreamgzip compressed pickle stream.
-
class
wpull.util.PickleStream(filename=None, file=None, mode='rb', protocol=3)[source]¶ Bases:
objectPickle stream helper.
-
wpull.util.filter_pem(data)[source]¶ Processes the bytes for PEM certificates.
Returns: setcontaining each certificate
-
wpull.util.get_exception_message(instance)[source]¶ Try to get the exception message or the class name.
-
wpull.util.get_package_data(filename, mode='rb')[source]¶ Return the contents of a real file or a zip file.
-
wpull.util.get_package_filename(filename, package_dir=None)[source]¶ Return the filename of the data file.
-
wpull.util.grouper(iterable, n, fillvalue=None)[source]¶ Collect data into fixed-length chunks or blocks
-
wpull.util.parse_iso8601_str(string)[source]¶ Parse a fixed ISO8601 datetime string.
Note
This function only parses dates in the format
%Y-%m-%dT%H:%M:%SZ. You must use a library likedateutilsto properly parse dates and times.Returns: A UNIX timestamp. Return type: float
version Module¶
Version information.
-
wpull.version.__version__¶ A string conforming to Semantic Versioning Guidelines
-
wpull.version.version_info¶ A tuple in the same format of
sys.version_info
waiter Module¶
Delays between requests.
-
class
wpull.waiter.LinearWaiter(wait=0.0, random_wait=False, max_wait=10.0)[source]¶ Bases:
wpull.waiter.WaiterA linear back-off waiter.
Parameters: - wait – The normal delay time
- random_wait – If True, randomly perturb the delay time within a factor of 0.5 and 1.5
- max_wait – The maximum delay time
This waiter will increment by values of 1 second.
warc Module¶
warc.format Module¶
WARC format.
For the WARC file specification, see http://bibnum.bnf.fr/WARC/WARC_ISO_28500_version1_latestdraft.pdf.
For the CDX specifications, see https://archive.org/web/researcher/cdx_file_format.php and https://github.com/internetarchive/CDX-Writer.
-
class
wpull.warc.format.WARCRecord[source]¶ Bases:
objectA record in a WARC file.
-
fields¶ An instance of
namevalue.NameValueRecord.
-
block_file¶ A file object. May be None.
-
CONTENT_TYPE= 'Content-Type'¶
-
NAME_OVERRIDES= frozenset({'Content-Length', 'WARC-Date', 'Content-Type', 'WARC-Warcinfo-ID', 'WARC-Segment-Origin-ID', 'WARC-Segment-Number', 'WARC-Block-Digest', 'WARC-Identified-Payload-Type', 'WARC-Refers-To', 'WARC-Target-URI', 'WARC-Type', 'WARC-Profile', 'WARC-Segment-Total-Length', 'WARC-Payload-Digest', 'WARC-Truncated', 'WARC-Record-ID', 'WARC-Concurrent-To', 'WARC-IP-Address', 'WARC-Filename'})¶ Field name case normalization overrides because hanzo’s warc-tools do not adequately conform to specifications.
-
REQUEST= 'request'¶
-
RESPONSE= 'response'¶
-
REVISIT= 'revisit'¶
-
SAME_PAYLOAD_DIGEST_URI= 'http://netpreserve.org/warc/1.0/revisit/identical-payload-digest'¶
-
TYPE_REQUEST= 'application/http;msgtype=request'¶
-
TYPE_RESPONSE= 'application/http;msgtype=response'¶
-
VERSION= 'WARC/1.0'¶
-
WARCINFO= 'warcinfo'¶
-
WARC_DATE= 'WARC-Date'¶
-
WARC_FIELDS= 'application/warc-fields'¶
-
WARC_RECORD_ID= 'WARC-Record-ID'¶
-
WARC_TYPE= 'WARC-Type'¶
-
compute_checksum(payload_offset: typing.Union=None)[source]¶ Compute and add the checksum data to the record fields.
This function also sets the content length.
-
get_http_header() → wpull.protocol.http.request.Response[source]¶ Return the HTTP header.
It only attempts to read the first 4 KiB of the payload.
Returns: Returns an instance of http.request.Responseor None.Return type: Response, None
-
set_common_fields(warc_type: str, content_type: str)[source]¶ Set the required fields for the record.
-
warc.recorder Module¶
-
class
wpull.warc.recorder.BaseWARCRecorderSession(recorder, temp_dir=None, url_table=None)[source]¶ Bases:
objectBase WARC recorder session.
-
class
wpull.warc.recorder.FTPWARCRecorderSession(*args, **kwargs)[source]¶ Bases:
wpull.warc.recorder.BaseWARCRecorderSessionFTP WARC Recorder Session.
-
class
wpull.warc.recorder.HTTPWARCRecorderSession(*args, **kwargs)[source]¶ Bases:
wpull.warc.recorder.BaseWARCRecorderSessionHTTP WARC Recorder Session.
-
class
wpull.warc.recorder.WARCRecorder(filename, params=None)[source]¶ Bases:
objectRecord to WARC file.
Parameters: - filename (str) – The filename (without the extension).
- params (
WARCRecorderParams) – Parameters.
-
CDX_DELIMINATOR= ' '¶ Default CDX delimiter.
-
DEFAULT_SOFTWARE_STRING= 'Wpull/2.0.1 Python/3.4.3'¶ Default software string.
-
classmethod
parse_mimetype(value)[source]¶ Return the MIME type from a Content-Type string.
Returns: A string in the form type/subtypeor None.Return type: str, None
-
wpull.warc.recorder.WARCRecorderParams¶ WARCRecorderparameters.Parameters: - compress (bool) – If True, files will be compressed with gzip
- extra_fields (list) – A list of key-value pairs containing extra metadata fields
- temp_dir (str) – Directory to use for temporary files
- log (bool) – Include the program logging messages in the WARC file
- appending (bool) – If True, the file is not overwritten upon opening
- digests (bool) – If True, the SHA1 hash digests will be written.
- cdx (bool) – If True, a CDX file will be written.
- max_size (int) – If provided, output files are named like
name-00000.extand the log file will be inname-meta.ext. - move_to (str) – If provided, completed WARC files and CDX files will be moved to the given directory
- url_table (
database.URLTable) – If given, thenrevistrecords will be written. - software_string (str) – The value for the
softwarefield in the Warcinfo record.
alias of
WARCRecorderParamsType
writer Module¶
Document writers.
-
class
wpull.writer.AntiClobberFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseFileWriterFile writer that downloads to a new filename if the original exists.
-
session_class¶
-
-
class
wpull.writer.AntiClobberFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]¶
-
class
wpull.writer.BaseFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseWriterBase class for saving documents to disk.
Parameters: - path_namer – The path namer.
- file_continuing – If True, the writer will modify requests to fetch the remaining portion of the file
- headers_included – If True, the writer will include the HTTP header responses on top of the document
- local_timestamping – If True, the writer will set the Last-Modified timestamp on downloaded files
- adjust_extension – If True, HTML or CSS file extension will be added whenever it is detected as so.
- content_disposition – If True, the filename is extracted from the Content-Disposition header.
- trust_server_names – If True and there is redirection, use the last given response for the filename.
-
session_class¶ Return the class of File Writer Session.
This should be overridden by subclasses.
-
class
wpull.writer.BaseFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]¶ Bases:
wpull.writer.BaseWriterSessionBase class for File Writer Sessions.
-
classmethod
open_file(filename: str, response: wpull.protocol.abstract.request.BaseResponse, mode='wb+')[source]¶ Open a file object on to the Response Body.
Parameters: - filename – The path where the file is to be saved
- response – Response
- mode – The file mode
This function will create the directories if not exist.
-
classmethod
-
class
wpull.writer.BaseWriterSession[source]¶ Bases:
objectBase class for a single document to be written.
-
discard_document(response: wpull.protocol.abstract.request.BaseResponse)[source]¶ Don’t save the document.
This function is called by a Processor once the Processor deemed the document should be deleted (i.e., a “404 Not Found” response).
-
extra_resource_path(suffix: str) → typing.Union[source]¶ Return a filename suitable for saving extra resources.
-
process_request(request: wpull.protocol.abstract.request.BaseRequest) → wpull.protocol.abstract.request.BaseRequest[source]¶ Rewrite the request if needed.
This function is called by a Processor after it has created the Request, but before submitting it to a Client.
Returns: The original Request or a modified Request
-
-
class
wpull.writer.IgnoreFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseFileWriterFile writer that ignores files that already exist.
-
session_class¶
-
-
class
wpull.writer.IgnoreFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]¶
-
class
wpull.writer.MuxBody(stream: typing.BinaryIO, **kwargs)[source]¶ Bases:
wpull.body.BodyWrites data into a second file.
-
class
wpull.writer.NullWriter[source]¶ Bases:
wpull.writer.BaseWriterFile writer that doesn’t write files.
-
class
wpull.writer.OverwriteFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseFileWriterFile writer that overwrites files.
-
session_class¶
-
-
class
wpull.writer.OverwriteFileWriterSession(path_namer: wpull.path.PathNamer, file_continuing: bool, headers_included: bool, local_timestamping: bool, adjust_extension: bool, content_disposition: bool, trust_server_names: bool)[source]¶
-
class
wpull.writer.SingleDocumentWriter(stream: typing.BinaryIO, headers_included: bool=False)[source]¶ Bases:
wpull.writer.BaseWriterWriter that writes all the data into a single file.
-
class
wpull.writer.SingleDocumentWriterSession(stream: typing.BinaryIO, headers_included: bool)[source]¶ Bases:
wpull.writer.BaseWriterSessionWrite all data into stream.
-
class
wpull.writer.TimestampingFileWriter(path_namer: wpull.path.PathNamer, file_continuing: bool=False, headers_included: bool=False, local_timestamping: bool=True, adjust_extension: bool=False, content_disposition: bool=False, trust_server_names: bool=False)[source]¶ Bases:
wpull.writer.BaseFileWriterFile writer that only downloads newer files from the server.
-
session_class¶
-
What’s New¶
Summary of notable changes.
Unreleased¶
2.0.1 (2016-06-21)¶
- Fixed: KeyError crash when psutil was not installed.
- Fixed: AttributeError proxy error using PhantomJS due to response body not written to a file.
2.0 (2016-06-17)¶
- Removed: Lua scripting support and its Python counterpart (
--lua-scriptand--python-script). - Removed: Python 3.2 & 3.3 support.
- Removed: PyPy support.
- Changed: IP addresses are normalized to a standard notation to avoid fetching duplicates such as IPv4 addresses written in hexadecimal or long-hand IPv6 addresses.
- Changed: Scripting is now done using plugin interface via
--plugin-script. - Fixed: Support for Python 3.5.
- Fixed: FTP unable to handle directory listing with date in MMM DD YYYY and filename containing YYYY-MM-DD text.
- Fixed: Downloads through the proxy (such as PhantomJS) now show up in the database and can be controlled through scripting.
- Fixed: NotFound error when converting links in CSS file that contain URLs that were not fetched.
- Fixed: When resuming a forcefully interrupted crawl (e.g., a crash) using a database, URLs in progress were not restarted because they were not reset in the database when the program started up.
Backwards incompatibility¶
This release contains backwards incompatible changes to the database schema and scripting interface.
If you use --database, the database created by older versions in
Wpull cannot be used in this version.
Scripting hook code will need to be rewritten to use the new API. See the new documentation for scripting for the new style of interfacing with Wpull.
Additionally for scripts, the internal event loop has switched from Trollius to built-in Asyncio.
1.2.3 (2016-02-03)¶
- Removed: cx_freeze build support.
- Deprecated: Lua Scripting support will be removed in next release.
- Deprecated: Python 3.2 & 3.3 support will be removed in the next release.
- Deprecated: PyPy support will be removed in the next release.
- Fixed: Error when logging in with FTP to servers that don’t need a password.
- Fixed: ValueError when downloading URLs that contain unencoded unprintable characters like Zero Width Non-Joiner or Right to Left Mark.
1.2.2 (2015-10-21)¶
- Fixed:
--output-documentfile doesn’t contain content. - Fixed: OverflowError when URL contains invalid port number greater than 65535 or less than 0.
- Fixed: AssertionError when saving IPv4-mapped IPv6 addresses to WARC files.
- Fixed: AttributeError when running with installed Trollius 2.0.
- Changed: The setup file no longer requires optional psutil.
1.2.1 (2015-05-15)¶
- Fixed: OverflowError with URLs with large port numbers.
- Fixed: TypeError when using standard input as input file (
--input-file -). - Changed: using
--youtube-dlrespects--inet4-onlyand--no-check-certificatenow.
1.2 (2015-04-24)¶
- Fixed: Connecting to sites with IPv4 & IPv6 support resulted in errors when IPv6 was not supported by the local network. Connections now use Happy Eyeballs Algorithm for IPv4 & IPv6 dual-stack support.
- Fixed: SQLAlchemy error with PyPy and SQLAlchemy 1.0.
- Fixed: Input URLs are not fetched in order. Regression since 1.1.
- Fixed: UnicodeEncodeError when fetching FTP files with non-ASCII filenames.
- Fixed: Session cookies not loaded when using
--load-cookies. - Fixed:
--keep-session-cookieswas always on. - Changed: FTP communication uses UTF-8 instead of Latin-1.
- Changed:
--prefer-family=noneis now default. - Added:
noneas a choice to--prefer-family. - Added:
--no-globand FTP filename glob support.
1.1.1 (2015-04-13)¶
- Changed: when using
--youtube-dland--warc-file, the JSON metedata file is now saved in the WARC file compatible with pywb. - Changed: logging and progress meter to say “unspecified” instead of “none” when no content length is provided by server to match Wget.
1.1 (2015-04-03)¶
- Security: Updated certificate bundle.
- Fixed:
--regex-typeto acceptpcreinstead ofposix. Regular expressions always use Python’s regex library. Posix regex is not supported. - Fixed: when using
--warc-max-sizeand--warc-append, it wrote to existing sequential WARC files unnecessarily. - Fixed: input URLs stored in memory instead of saved on disk. This issue was notable if there were many URLs provided by the
--input-fileoption. - Changed: when using
--warc-max-sizeand--warc-append, the next sequential WARC file is created to avoid appending to corrupt files. - Changed: WARC file writing to use journal files and refuse to start program if any journals exist. This avoids corrupting files through naive use of
--warc-appendand allow for future automated recovery. - Added: Open Graph and Twitter Card element links extraction.
1.0 (2015-03-14)¶
- Fixed: a
--databasepath with a question mark (?) truncated the path, did not use a on-disk database, or causesTypeError. The question mark is automatically replaced with an underscore. - Fixed: HTTP proxy support broken since version 0.1001.
- Added:
no_proxyenvironment variable support. - Added:
--proxy-domains,--proxy-exclude-domains,--proxy-hostnames,--proxy-exclude-hostnames - Removed:
--no-secure-proxy-tunnel.
0.1009 (2015-03-08)¶
- Added:
--preserve-permissions. - Fixed: exit code returned as 2 instead of 1 on generic errors.
- Changed: exception tracebacks are printed only on generic errors.
- Changed: temporary WARC log file is now compressed to save space.
Scripting Hook API:
- Added: Version 3 API
- Added:
wait_timeto version 3 which provides useful context including response or error infos.
0.1008 (2015-02-26)¶
- Security: updated certificate bundle.
- Fixed: TypeError crash on bad Meta Refresh HTML element.
- Fixed: unable to fetch FTP files with spaces and other special characters.
- Fixed: AssertionError fetching URLs with trailing dot not properly removed.
- Added:
--no-cache. - Added:
--report-speed. - Added:
--monitor-diskand--monitor-memory.
0.1007 (2015-02-19)¶
- Fixed malformed URLs printed to logs without sanitation.
- Fixed AttributeError crash on FTP servers that support MLSD.
- Improved link recursion heuristics when extracting from JavaScript and HTML.
- Added
--retr-symlinks. - Added
--session-timeout.
0.1006.1 (2015-02-09)¶
- Security: Fixed
RefererHTTP header field leaking from HTTPS to HTTP. - Fixed
AttributeErrorin proxy when using PhantomJS andpre_responsescripting hook. - Fixed early program end when server returns error fetching robots.txt.
- Fixed uninteresting errors outputted if program is forcefully closed.
- Fixed
--refereroption not applied to subsequent requests.
0.1006 (2015-02-01)¶
- Fixed inability to fetch URLs with hostnames starting/ending with hyphen.
- Fixed “Invalid file descriptor” error in proxy server.
- Fixed FTP listing dates mistakenly parsed as future date within the same month.
- Added
--escaped-fragmentoption. - Added
--strip-session-idoption. - Added
--no-skip-getaddrinfooption. - Added
--limit-rateoption. - Added
--phantomjs-max-timeoption. - Added
--youtube-dloption. - Added
--plugin-scriptoption. - Improved PhantomJS stability.
0.1005 (2015-01-15)¶
- Security: SSLv2/SSLv3 is disabled for
--secure-protocol=auto. Added--no-strong-cryptothat re-enables them again if needed. - Fixed NameError with PhantomJS proxy on Python 3.2.
- Fixed PhantomJS stop waiting for page load too early.
- Fixed “Line too long” error and remove uninteresting page errors during PhantomJS.
- Fixed
--page-requisitesexceeding--level. - Fixed
--no-verbosenot providing informative messages and behaving like--quiet. - Fixed infinite page requisite recursion when using
--span-hosts-allow page-requisites. - Added
--page-requisites-level. The default max recursion depth on page requisites is now 5. - Added
--very-quiet. --no-verboseis defaulted when--concurrentis 2 or greater.
Database Schema:
- URL
inlinecolumn is now an integer.
0.1004.2 (2015-01-03)¶
Hotfix release.
- Fixed PhantomJS mode’s MITM proxy AttributeError on certificates.
0.1004.1 (2015-01-03)¶
- Fixed TypeError crash on a bad cookie.
- Fixed PhantomJS mode’s MITM proxy SSL certificates not installed.
0.1004 (2014-12-25)¶
- Fixed FTP data connection reuse error.
- Fixed maximum recursion depth exceeded on FTP downloads.
- Fixed FTP file listing detecting dates too eagerly as ISO8601 format.
- Fixed crash on FTP if file listing could not find a date in a line.
- Fixed HTTP status code 204 “No Content” interpreted as an error.
- Fixed “cert already in hash table” error when using both OS and Wpull’s certificates.
- Improved PhantomJS stability. Timeout errors should be less frequent.
- Added
--adjust-extension. - Added
--content-disposition. - Added
--trust-server-names.
0.1003 (2014-12-11)¶
- Fixed FTP fetch where code 125 was not recognized as valid.
- Fixed FTP 12 o’clock AM/PM time logic.
- Fixed URLs fetched as lowercase URLs when scheme and authority separator is not provided.
- Added
--database-urioption to specify a SQLAlchemy URI. - Added
noneas a choice to--progress. - Added
--user/--passwordsupport. - Scripting:
- Fixed missing response callback during redirects. Regression introduced in v0.1002.
0.1002 (2014-11-24)¶
- Fixed control characters printed without escaping.
- Fixed cookie size not limited correctly per domain name.
- Fixed URL parsing incorrectly allowing spaces in hostnames.
- Fixed
--sitemapsoption not respecting--no-parent. - Fixed “Content overrun” error on broken web servers. A warning is logged instead.
- Fixed SSL verification error despite
--no-check-certificateis specified. - Fixed crash on IPv6 URLs containing consecutive dots.
- Fixed crash attempting to connect to IPv6 addresses.
- Consecutive slashes in URL paths are now flattened.
- Fixed crash when fetching IPv6 robots.txt file.
- Added experimental FTP support.
- Switched default HTML parser to html5lib.
- Scripting:
- Added
handle_pre_responsecallback hook.
- Added
- API:
- Fixed
ConnectionPoolmax_host_countargument not used. - Moved document scraping concerns from
WebProcessorSessiontoProcessingRule. - Renamed
SSLVerficationErrortoSSLVerificationError.
- Fixed
0.1001.2 (2014-10-25)¶
- Fixed ValueError crash on HTTP redirects with bad IPv6 URLs.
- Fixed AssertionError on link extraction with non-absolute URLs in “codebase” attribute.
- Fixed premature exit during an error fetching robots.txt.
- Fixed executable filename problem in setup.py for cx_Freeze builds.
0.1001.1 (2014-10-09)¶
- Fixed URLs with IPv6 addresses not including brackets when using them in host strings.
- Fixed AssertionError crash where PhantomJS crashed.
- Fixed database slowness over time.
- Cookies are now synchronized and shared with PhantomJS.
- Scripting:
- Fixed mismatched
queued_url` and ``dequeued_urlcausing negative values in a counter. Issue was caused by requeued items in “error” status.
- Fixed mismatched
0.1001 (2014-09-16)¶
- Fixed
--warc-moveoption which had no effect. - Fixed JavaScript scraper to not accept URLs with backslashes.
- Fixed CSS scraper to not accept URLs longer than 500 characters.
- Fixed ValueError crash in Cache when two URLs are added sequentially at the same time due to bad LinkedList key comparison.
- Fixed crash formatting text when sizes reach terabytes.
- Fixed hang which may occur with lots of connection across many hostnames.
- Support for HTTP/HTTPS proxies but no HTTPS tunnelling support. Wpull will refuse to start without the insecure override option. Note that if authentication and WARC file is enabled, the username and password is recorded into the WARC file.
- Improved database performance.
- Added
--ignore-fatal-errorsoption. - Added
--http-parseroption. You can now use html5lib as the HTML parser. - Support for PyPy 2.3.1 running with Python 3.2 implementation.
- Consistent URL parsing among various Python versions.
- Added
--link-extractorsoption. - Added
--debug-manholeoption. - API:
documentandscraperwere put into their own packages.- HTML parsing was put into
document.htmlparsepackage. url.URLInfono longer supports normalizing URLs by percent decoding unreserved/safe characters.
- Scripting:
- Dropped support for Scripting API version 1.
- Database schema:
- Column
url_encodingis removed fromurlstable.
- Column
0.1000 (2014-09-02)¶
- Dropped support for Python 2. Please file an issue if this is a problem.
- Fixed possible crash on empty content with deflate compression.
- Fixed document encoding detection on documents larger than 4096 bytes where an encoded character may have been truncated.
- Always percent-encode IRIs with UTF-8 to match de facto web browser implementation.
- HTTP headers are consistently decoded as Latin-1.
- Scripting API:
- New
queued_urlanddequeued_urlhooks contributed by mback2k.
- New
- API:
- Switched to Trollius instead of Tornado. Please use Trollius 1.0.2 alpha or greater.
- Most the of internals related to the HTTP protocol were rewritten and as a result, major components are not backwards compatible; lots of changes were made. If you happen to be using Wpull’s API, please pin your requirements to
<0.1000if you do not want to make a migration. Please file an issue if this is a problem.
0.36.4 (2014-08-07)¶
- Fixes crash when
--save-cookiesis used with non-ASCII cookies. Cookies with non-ASCII values are discarded. - Fixed HTTP gzip compressed content not decompressed during chunked transfer of single bytes.
- Tornado 4.0 support.
- API:
- Renamed:
cookie.CookieLimitsPolicytoDeFactoCookiePolicy.
- Renamed:
0.36.3 (2014-07-25)¶
- Improved performance on
--databaseoption. SQLite now uses synchronous=NORMAL instead of FULL.
0.36.2 (2014-07-16)¶
- Fixed requirements.txt to use Tornado version less than 4.0.
0.36.1 (2014-07-16)¶
- Fixes bug where “FINISHED” message was not logged in WARC file meta log. Regression was introduced in version 0.35.
0.36 (2014-06-23)¶
- Works around
PhantomJSRPCTimedOuterrors. - Adds
--phantomjs-exeoption. - Supports extracting links from HTML
imgsrcsetattribute. - API:
Builder.build()returnsApplicationinstead ofEngine.- Callback hooks
exit_statusandfinishing_statisticsnow registered onApplicationinstead ofEngine. networkmodule split into two modulesbandwidthanddns.- Adds
observermodule. phantomjs.PhantomJSRemote.page_eventrenamed topage_observer.
0.35 (2014-06-16)¶
- Adds
--warc-moveoption. - Scripting:
- Default scripting version is now 2.
- API:
- Builder moved into new module builder
- Adds Application class intended for different UI in the future.
Resolverfamiliesparameter renamed intofamily. It accepts values from the modulesocketorPREFER_IPv4/PREFER_IPv6.- Adds
HookableMixin. This removes the use of messy subclassing for scripting hooks.
0.34.1 (2014-05-26)¶
- Fixes crash when a URL is incorrectly formatted by Wpull. (The incorrect formatting is not fixed yet however.)
0.34 (2014-05-06)¶
- Fixes file descriptor leak with
--phantomjsand--delete-after. - Fixes case where robots.txt file was stuck in download loop if server was offline.
- Fixes loading of cookies file from Wget. Cookie file header checks are disabled.
- Removes unneeded
--no-strong-robots(superseded with--no-strong-redirects.) - Fixes
--no-phantomjs-snapshotoption not respected. - More link extraction on HTML pages with elements with
onclick,onkeyX,onmouseX, anddata-attributes. - Adds web-based debugging console with
--debug-console-port.
0.33.2 (2014-04-29)¶
- Fixes links not resolved correctly when document includes
<base href="...">element. - Different proxy URL rewriting for PhantomJS option.
0.33.1 (2014-04-26)¶
- Fixes
--bind_addressoption not working. The option was never functional since the first release. - Fixes AttributeError crash when
--phantomjsand--X-scriptoptions were used. Thanks to yipdw for reporting. - Fixes
--warc-tempdirto use the current directory by default. - Fixes bad formatting and crash on links with malformed IPv6 addresses.
- Uses more rules for link extraction from JavaScript to reduce false positives.
0.33 (2014-04-21)¶
- Fixes invalid XHTML documents not properly extracted for links.
- Fixes crash on empty page.
- Support for extracting links from JavaScript segments and files.
- Doesn’t discard extracted links if document can only be parsed partially.
- API:
- Moves
OrderedDefaultDictfromutiltocollections. - Moves
DeflateDecompressor,gzip_decompressfromutiltodecompression. - Moves
sleep,TimedOut,wait_future,AdjustableSemaphorefromutiltoasync. - Moves
to_bytes,to_str,normalize_codec_name,detect_encoding,try_decoding,format_size,printable_bytes,coerce_str_to_asciifromutiltostring. - Removes
extendedmodule.
- Moves
- Scripting:
- Adds new wait_time() callback hook function.
0.32.1 (2014-04-20)¶
- Fixes XHTML documents not properly extracted for links.
- If a server responds with content declared as Gzip, the content is checked to see if it starts with the Gzip magic number. This check avoids misreading text as Gzip streams.
0.32 (2014-04-17)¶
- Fixes crash when HTML meta refresh URL is empty.
- Fixes crash when decoding a document that is malformed later in the document. These invalid documents are not searched for links.
- Reduces CPU usage when
--debuglogging is not enabled. - Better support for detecting and differentiating XHTML and XML documents.
- Fixes converting XHTML documents where it did not write XHTML syntax.
- RSS/Atom feed
link,url,iconelements are searched for links. - API:
document.detect_response_encoding()default peek argument is lowered to reduce hanging.document.BaseDocumentDetectoris now a base class for document type detection.
0.31 (2014-04-14)¶
- Fixes issue where an early
</html>causes link discovery to be broken and converted documents missing elements. - Fixes
--no-parentwhich did not behave like Wget. This issue was noticeable with options such as--span-hosts-allow linked-pages. - Fixes
--levelwhere page requisites were mistakenly not fetched if it exceeds recursion level. - Includes PhantomJS version string in WARC warcinfo record.
- User-agent string no longer includes Mozilla reference.
- Implements
--force-htmland--base. - Cookies now are limited to approximately 4 kilobytes and a maximum of 50 cookies per domain.
- Document parsing is now streamed for better handling of large documents.
- Scripting:
- Ability to set a scripting API version.
- Scripting API version 2: Adds
record_infoargument tohandle_errorandhandle_response.
- API:
- WARCRecorder uses new parameter object WARCRecorderParams.
document,scraper,convertermodules heavily modified to accommodate streaming readers.document.BaseDocumentReader.parsewas removed and replaced withread_links.- version.version_info available.
0.30 (2014-04-06)¶
- Fixes crash on SSL handshake if connection is broken.
- DNS entries are periodically removed from cache instead of held for long times.
- Experimental cx_freeze support.
- PhantomJS:
- Fixes proxy errors with requests containing a body.
- Fixes proxy errors with occasional FileNotFoundError.
- Adds timeouts to calls.
- Viewport size is now 1200 × 1920.
- Default
--phantomjs-scrollis now 10. - Scrolls to top of page before taking snapshot.
- API:
- URL filters moved into urlfilter module.
- Engine uses and exposes interface to AdjustableSemaphore for issue #93.
0.29 (2014-03-31)¶
- Fixes SSLVerficationError mistakenly raised during connection errors.
--span-hostsno longer implicitly enabled on non-recursive downloads. This behavior is superseded by strong redirect logic. (Use--span-hosts-allowto guarantee fetching of page-requisites.)- Fixes URL query strings normalized with unnecessary percent-encoding escapes. Some servers do not handle percent-encoded URLs well.
- Fixes crash handling directory paths that may contain a filename or a filename that is a directory. This crash occurs when a URL like /blog and /blog/ exists. If a directory path contains a filename, the part of the directory path is suffixed with .d. If a filename is an existing directory, the filename is suffixed with .f.
- Fixes crash when URL’s hostname contains characters that decompose to dots.
- Fixes crash when HTML document declares encoding name unknown to Python.
- Fixes stuck in loop if server returns errors on robots.txt.
- Implements
--warc-dedup. - Implements
--ignore-length. - Implements
--output-document. - Implements
--http-compression. - Supports reading HTTP compression “deflate” encoding (both zlib and raw deflate).
- Scripting:
- Adds
engine_run()callback. - Exposes the instance factory.
- Adds
- API:
- connection:
Connectionarguments changed. UsesConnectionParamsas a parameter object.HostConnectionPoolarguments also changed. - database:
URLDBRecordrenamed toURL.URLStrDBRecordrenamed toURLString.
- connection:
- Schema change:
- New
visitstable.
- New
0.28 (2014-03-27)¶
- Fixes crash when redirected to malformed URL.
- Fixes
--directory-prefixnot being honored. - Fixes unnecessary high CPU usage when determining encoding of document.
- Fixes crash (GeneratorExit exception) when exiting on Python 3.4.
- Uses new internal socket connection stream system.
- Updates bundled certificates (Tue Jan 28 09:38:07 2014).
- PhantomJS:
- Fixes things not appearing in WARC files. This regression was introduced in 0.26 where PhantomJS’s disk cache was enabled. It is now disabled again.
- Fixes HTTPS proxy URL rewriting where relative URLs were not properly rewritten.
- Fixes proxy URL rewriting not working for localhost.
- Fixes unwanted
Accept-Languageheader picked up from environment. The value has been overridden to*. - Fixes
--headeroptions left out in requests.
- API:
- New
iostreammodule. extendedmodule is deprecated.
- New
0.27 (2014-03-23)¶
- Fixes URLs ignored (if any) on command line when
--input-fileis specified. - Fixes crash when redirected to a URL that is not HTTP.
- Fixes crash if lxml does not recognize the document encoding name. Falls back to Latin1 if lxml does not support the encoding after massaging the encoding name.
- Fixes crash on IPv6 addresses when using scripting or external API calls.
- Fixes speed shown as “0.0 B/s” instead of “– B/s” when speed can not be calculated.
- Implements
--local-encoding,--remote-encoding,--no-iri. - Implements
--https-only. - Prints bandwidth speed statistics when exiting.
- PhantomJS:
- Implements “smart scrolling” that avoids unnecessary scrolling.
- Adds
--no-phantomjs-smart-scroll
- API:
WebProcessorSession._parse_url()renamed toWebProcessorSession.parse_url()
0.26 (2014-03-16)¶
- Fixes crash when URLs like
http://example.com]were encountered. - Implements
--sitemaps. - Implements
--max-filename-length. - Implements
--span-hosts-allow(experimental, see issues #61, #66). - Query strings items like
?a&bare now preserved and no longer normalized to?a=&b=. - API:
- url.URLInfo.normalize() was removed since it was mainly used internally.
- Added url.normalize() convenience function.
- writer: safe_filename(), url_to_filename(), url_to_dir_path() were modified.
0.25 (2014-03-13)¶
- Fixes link converter not operating on the correct files when
.Nfiles were written. - Fixes apparent hang when Wpull is almost finished on documents with many links.
- Previously, Wpull adds all URLs to the database causing overhead processing to be done in the database. Now, only requisite URLs are added to the database.
- Implements
--restrict-file-names. - Implements
--quota. - Implements
--warc-max-size. Like Wget, “max size” is not the maximum size of each WARC file but it is the threshold size to trigger a new file. Unlike Wget,requestandresponserecords are not split across WARC files. - Implements
--content-on-error. - Supports recording scrolling actions in WARC file when PhantomJS is enabled.
- Adds the
wpullcommand tobin/. - Database schema change:
filenamecolumn was added. - API:
- converter.py: Converters no longer use PathNamer.
- writer.py:
sanitize_file_parts()was removed in favor of newsafe_filename().save_document()returns a filename. - WebProcessor now requires a root path to be specified.
- WebProcessor initializer now takes “parameter objects”.
- Install requires new dependency:
namedlist.
0.24 (2014-03-09)¶
- Fixes crash when document encoding could not be detected. Thanks to DopefishJustin for reporting.
- Fixes non-index files incorrectly saved where an extra directory was added as part of their path.
- URL path escaping is relaxed. This helps with servers that don’t handle percent-encoding correctly.
robots.txtnow bypasses the filters. Use--no-strong-robotsto disable this behavior.- Redirects implicitly span hosts. Use
--no-strong-redirectsto disable this behavior. - Scripting:
should_fetch()info dict now containsreasonas a key.
0.23.1 (2014-03-07)¶
- Important: Fixes issue where URLs were downloaded repeatedly.
0.23 (2014-03-07)¶
- Fixes incorrect logic in fetching robots.txt when it redirects to another URL.
- Fixes port number not included in the HTTP Host header.
- Fixes occasional
RuntimeErrorwhen pressing CTRL+C. - Fixes fetching URL paths containing dot segments. They are now resolved appropriately.
- Fixes ASCII progress bar not showing 100% when finished download occasionally.
- Fixes crash and improves handling of unusual document encodings and settings.
- Improves handling of links with newlines and whitespace intermixed.
- Requires beautifulsoup4 as a dependency.
- API:
util.detect_encoding()arguments modified to accept only a single fallback and to acceptis_html.document.get_encoding()acceptsis_htmlandpeekarguments.
0.22.5 (2014-03-05)¶
- The ‘Refresh’ HTTP header is now scraped for URLs.
- When an error occurs during writing WARC files, the WARC file is truncated back to the last good state before crashing.
- Works around error “Reached maximum read buffer size” downloading on fast connections. Side effect is intensive CPU usage.
0.22.4 (2014-03-05)¶
- Fixes occasional error on chunked transfer encoding. Thanks to ivan for reporting.
- Fixes handling links with newlines found in HTML pages. Newlines are now stripped in links when scraping pages to better handle HTML soup.
0.22.3 (2014-03-02)¶
- Fixes another case of
AssertionErroronurl_item.is_processedwhen robots.txt was enabled. - Fixes crash if a malformed gzip response was received.
- Fixes
--span-hoststo be implicitly enabled (as with--no-robots) if--recursiveis not supplied. This behavior unconditionally allows downloading a single file without specifying any options. It is what a user intuitively expects.
0.22.2 (2014-03-01)¶
- Improves performance on database operations. CPU usage should be less intensive.
0.22.1 (2014-02-28)¶
- Fixes handling of “204 No Content” responses.
- Fixes
AssertionErroronurl_item.is_processedwhen robots.txt was enabled. - Fixes PhantomJS page scrolling to be consistent.
- Lengthens PhantomJS viewport to ensure lazy-load images are properly triggered.
- Lengthens PhantomJS paper size to reduce excessive fragmentation of blocks.
0.22 (2014-02-27)¶
- Implements
--phantomjs-scrolland--phantomjs-wait. - Implements saving HTML and PDF snapshots (including inside WARC file). Disable with
--no-phantomjs-snapshot. - API: Adds PhantomJSController.
0.21.1 (2014-02-27)¶
- Fixes missing dependencies and files in
setup.py. - For PhantomJS:
- Fixes capturing HTTPS connections .
- Fixes statistics counter.
- Supports very basic scraping of HTML. See Usage section.
0.21 (2014-02-26)¶
- Fixes Request factory not used. This resolves issues where the User Agent was not set.
- Experimental PhantomJS support. It can be enabled with
--phantomjs. See the Usage section in the documentation for more details. - API changes:
- The
httpmodule was split up into smaller modules:http.client,http.connection,http.request,http.util. ChunkedTransferStreamReaderwas added as a reusable abstraction.- The
webmodule was moved tohttp.web. - Added
proxymodule. - Added
phantomjsmodule.
- The
0.20 (2014-02-22)¶
- Implements
--no-dns-cache,--accept,--reject. - Scripting: Fixes
AttributeErrorcrash onhandle_error. - Another possible fix for issue #27.
0.19.2 (2014-02-18)¶
- Fixes crash if a non-HTTP URL was found during download.
- Lua scripting: Fixes booleans, coming from Wpull, mistakenly converted to integers on Python 2
0.19.1 (2014-02-14)¶
- Fixes
--timestampingfunctionality. - Fixes
--timestampingnot checking.origfiles. - Fixes HTTP handling of responses which do not return content.
0.19 (2014-02-12)¶
- Fixes files not actually being written.
- Implements
--convert-linksand--backup-converted. - API:
HTMLScraperfunctions were refactored to be class methods.ScrapedLinkwas renamed toLinkInfo.
0.18.1 (2014-02-11)¶
- Fixes error when WARC but not CDX option is specified.
- Fixes closing of the SQLite database to avoid leaving temporary database files.
0.18 (2014-02-11)¶
- Implements
--no-warc-digests,--warc-cdx. - Improvements on reducing CPU usage consumption.
- API: Engine and Processor interaction refactored to be asynchronous.
- The Engine and Processor classes were modified significantly.
- The Engine no longer is concerned with fetching requests.
- Requests are handled within Processors. This will benefit future Processors to allow them to make arbitrary requests during processing.
- The
RedirectTrackerwas moved to a newwebmodule. - A
RichClientis implemented. It handles robots.txt, cookies, and redirect concerns. WARCRecordwas moved into a newwarcmodule.
0.17.3 (2014-02-07)¶
- Fixes ca-bundle file missing during install.
- Fixes AttributeError on
retry_dns_error.
0.17.2 (2014-02-06)¶
- Another attempt to possibly fix #27.
- Implements cleaning inactive connections from the connection pool.
0.17.1 (2014-02-05)¶
- Another attempt to possibly fix #27.
- API: Refactored
ConnectionPool. It now callsputonHostConnectionPoolto avoid sharing a queue.
0.17 (2014-02-05)¶
- Implements cookie support.
- Fixes non-recursive downloads where robots.txt was checked unnecessarily.
- Possibly fix issue #27 where HTTP workers get stuck.
0.16.1 (2014-02-05)¶
- Adds some documentation about stopping Wpull and a list of all options.
- API:
Buildernow exposesFactory. - API:
WebProcessorSessionwas refactored to not pass arguments through the initializer. It also now usesDemuxDocumentScraperandDemuxURLFilter.
0.16 (2014-02-04)¶
- Implements all the SSL options:
--certificate,--random-file,--egd-file,--secure-protocol. - Further improvement on database performance.
0.15.2 (2014-02-03)¶
- Improves database performance on reducing CPU usage.
0.15.1 (2014-02-03)¶
- Improves database performance on reducing disk reading.
0.15 (2014-02-02)¶
- Fixes robots.txt being fetched for every request.
- Scripts: Supports
replaceas part ofget_urls(). - Schema change: The database URL strings are normalized into a separate table. Using
--databaseshould now consume less disk space.
0.14.1 (2014-02-02)¶
- NameValueRecord now supports a
normalize_overrideargument to how specific keys are cased instead of the default title-case. - Fixes WARC file’s field names to match the same cases as hanzo’s warc-tools. warc-tools does not support case-insensitivity as required by the WARC specification in section 4. The WARC files generated by Wpull are conformant however.
0.14 (2014-02-01)¶
- Database change: SQLAlchemy is now used for the URL Table.
- Scripts:
url_info['inline']now returns a boolean, not an integer.
- Scripts:
- Implements
--post-dataand--post-file. - Scripts can now return
post_dataandlink_typeas part ofget_urls().
0.13 (2014-01-31)¶
- Supports reading HTTP responses with gzip content type.
0.12 (2014-01-31)¶
- No changes to program usage itself.
- More documentation.
- Major API changes due to refactoring:
http.Bodymoved toconversation.Bodydocument.HTTPScraper,document.CSSScrapermoved toscrapermodule.conversationmodule now contains base classes for protocol elements.processor.WebProcessorSessionnow uses keyword argumentsengine.EnginerequiresStatisticsargument.
0.11 (2014-01-29)¶
- Implements
--progresswhich includes a progress bar indicator. - Bumps up the HTTP connection buffer size to support fast connections.
0.10.9 (2014-01-28)¶
- Adds documentation. No program changes.
0.10.8 (2014-01-26)¶
- Improves robustness against bad HTTP protocol messages.
- Fixes various URL and IRI handling issues.
- Fixes
--input-fileto work as expected. - Fixes command line arguments not working under Python 2.
0.10 (2014-01-23)¶
- Improves handling on URLs and document encodings.
- Implements
--ascii-print. - Fixes Lua scripting conversion of Python to Lua object types.
0.9 (2014-01-21)¶
- Adds basic SSL options.
0.8 (2014-01-21)¶
- Supports Python and Lua scripting via
--python-scriptand--lua-script.
0.7 (2014-01-18)¶
- Fixes robots.txt support.
0.6 (2014-01-17)¶
- Implements
--warc-append,--concurrent. --read-timeoutdefault is 900 seconds.
0.5 (2014-01-17)¶
- Implements
--no-http-keepalive,--rotate-dns. - Adds basic support for HTTPS.
0.4 (2014-01-15)¶
- Implements
--continue,--no-clobber,--timestamping.
0.3.2 (2014-01-07)¶
- Fixes database rows not saved correctly.
0.3 (2014-01-07)¶
- Implements
--hostnamesand--exclude-hostnames.
0.2 (2014-01-06)¶
- Implements
--headeroption. - Various 3to2 bug fixes.
0.1 (2014-01-05)¶
- The first usable release.
WARC Specification¶
Additional de-facto and custom extensions to the WARC standard.
Wpull follows the specifications in the ISO 28500 latest draft.
FTP¶
FTP recording follows Heritrix specifications.
Control Conversation¶
The Control Conversation is recorded as
- WARC-Type:
metadata - Content-Type:
text/x-ftp-control-conversation - WARC-Target-URI: a URL. For example,
ftp://anonymous@example.com/treasure.txt - WARC-IP-Address: an IPv4 address with port or an IPv6 address with brackets and port
The resource is formatted as followed:
- Events are indented with an ASCII asterisk and space.
- Requests are indented with an ASCII greater-than and space.
- Responses are indented with an ASCII less-than and space.
The document encoding is UTF-8.
Changed in version 1.2a1: The document encoding previously used Latin-1.
Response data¶
The response data is recorded as
- WARC-Type:
resource - WARC-Target-URI: a URL. For example,
ftp://anonymous@example.com/treasure.txt - WARC-Concurrent-To: a WARC Record ID of the Control Conversation
PhantomJS¶
Snapshot¶
A PhantomJS Snapshot represents the state of the DOM at the time of capture.
A Snapshot is recorded as
- WARC-Type:
resource - WARC-Target-URI:
urn:X-wpull:snapshot?url=URLHEREwhereURLHEREis a percent-encoded URL of the PhantomJS page. - Content-Type: one of
application/pdf,text/html,image/png - WARC-Concurrent-To: a WARC Record ID of a Snapshot Action Metadata.
Snapshot Action Metadata¶
An Action Metadata is a log of steps performed before a Snapshot is taken.
It is recorded as
- WARC-Type:
metadata - Content-Type:
application/json - WARC-Target-URI:
urn:X-wpull:snapshot?url=URLHEREwhereURLHEREis a percent-encoded URL of the PhantomJS page.
Wpull Metadata¶
Log¶
Wpull’s log is recorded as
- WARC-Type:
resource - Content-Type:
text/plain - WARC-Target-URI:
urn:X-wpull:log
The document encoding is UTF-8.
youtube-dl¶
JSON file is recorded as
- WARC-Type:
metadata - Content-Type:
application/vnd.youtube-dl_formats+json - WARC-Target-URI:
metadata://AUTHORITY_AND_RESOURCEwhereAUTHORITY_AND_RESOURCEis the hierarchical part, query, and fragment of the URL passed to youtube-dl. In other words, the URI is the URL where the scheme is replaced withmetadata.