cupage

cupage checks web pages and displays changes from the last run that match a given criteria. Its original purpose was to check web pages for new software releases, but it is easily configurable and can be used for other purposes.

It is written in Python, and requires v2.6 or later. cupage is released under the GPL v3

Contents:

Background

I had been looking for a better way to help me keep on top of software releases for the projects I’m interested in, be that either personally or for things we use at work.

Some projects have Atom feeds, some have mailing lists just for release updates, some post updates on sites like freshmeat and some have no useful update watching mechanism at all. Tracking all these resources is annoying and a simple unified solution would be much more workable.

cupage is that solution, at least for my purposes. Maybe it could be for you too!

Database

With a local, unified tool we would instantly gain easy access to the updates database for use from other tools and applications.

JSON was chosen as it is simple to read and write, especially so from Python using the json module [1].

The database is a simple serialisation of the cupage.Sites object. The cupage.Sites object is a container for cupage.Site objects. Only persistent data from cupage.Site objects that can not be regenerated from the configuration file is stored in the database, namely last check time stamp and the current matches.

matches is an array, and contains the string matches of previous cupage runs.

checked is the offset in seconds from the Unix epoch that the site was last checked. It is normally a float, but may be null.

An example database file could be:

{
    "geany-plugins": {
        "matches": [
            "geany-plugins-0.17.1.tar.bz2",
            "geany-plugins-0.17.1.tar.gz",
            "geany-plugins-0.17.tar.bz2",
            "geany-plugins-0.17.tar.gz",
            "geany-plugins-0.18.tar.bz2",
            "geany-plugins-0.18.tar.gz"
        ],
        "checked": 1256677592.0
    },
    "interlude": {
        "matches": [
            "interlude-1.0.tar.gz"
        ],
        "checked": null
    }
}
[1]Pickle was used in versions prior to 0.3.0. The switch was made as Pickle provided no benefits over JSON, and some significant drawbacks including the lack of support for reading it from other languages.

Configuration

cupage stores its configuration in ~/.cupage.conf by default, although you can specify a different location with the cupage list -f command line option.

The configuration file is a INI format file, with a section for each site definition. The section header is the site’s name which will be displayed in the update output, or used to select individual sites to check on the command line. Each section consists of a section of name=value option pairs.

An example configuration file is below:

[pep8]
site = pypi
match_type = tar
[pydelicious]
site = google code
match_type = zip
[pyisbn]
url = http://www.jnrowe.ukfsn.org/_downloads/
select = pre > a
match_type = tar
frequency = 6m
[upoints]
url = http://www.jnrowe.ukfsn.org/_downloads/
select = pre > a
match_type = tar
[fruity]
site = vim-script
script = 1871
[cupage]
site = github
user = JNRowe
frequency = 1m

Site definitions can either be specified entirely manually, or possibly with the built-in site matchers(see site option for available options).

frequency option

The frequency option allows you to set a minimum time between checks for specific sites within the configuration file.

The format is <value> <units> where value can be a integer or float, and units must be one of the entries from the table below:

Unit Purpose
h Hours
d Days
w Week
m Month, which is defined as 28 days
y Year, which is defined as 13 m units

match option

If match_type is re then match must be a valid regular expression that will be used to match within the selected elements. For most common uses a prebuilt match_type already exists(see match_type option), and re should really only be used as a last resort.

The Python re module is used, and any functionality allowed by the module is available in the match option(with the notable exception of the verbose syntax).

match_type option

The match_type value, if used, must be one of the following:

Match type Purpose
gem to match rubygems_ archives.
re to define custom regular expressions
tar to match gzip/bzip2/xz compressed tar archives(default)
zip to match zip archives

The match_type values simply select a predefined regular expression to use. The base match is <name>-[\d\.]+([_-](pre|rc)[\d]+)?\.<type>, where <name> is the section name and <type> is the value of match_type for this section.

select option

The select option, if used, must be a valid CSS or XPath selector depending on the value of selector (see selector option) . Unless specified CSS Cascading Style Sheets) is the default selector type.

selector option

The selector option, if used, must be one of the following:

Selector Purpose
css To select elements within the page using CSS selectors (default)
xpath To select elements within the page using XPath selectors

site option

The site option, if used, must be one of the following, hopefully self-explanatory values:

Site Added Required options
cpan v0.4.0  
debian v0.3.0  
failpad v0.5.0  
github v0.3.1 user (GitHub user name)
google code v0.1.0  
hackage v0.1.0  
pypi v0.1.0  
vim-script v0.3.0 script (script id on the vim website)

site options are simply shortcuts that are provided to reduce duplication in the configuration file. They define the values necessary to check for updates on the given site.

url option

The url value is the location of the page to be checked for updates. If used, it must be a valid FTP/HTTP/HTTPS address.

Usage

The cupage is run from the command prompt, and displays updates on stdout.

Options

--version

show program’s version number and exit

--help

show this help message and exit

-v, --verbose

produce verbose output

-q, --quiet

output only matches and errors

Commands

add - add definition to config file

-f <file>, --config <file>

configuration file to read

-s <site>, --site <site>

site helper to use

-u <url>, --url <url>

site url to check

-t <type>, --match-type <type>

pre-defined regular expression to use

-m <regex>, --match <regex>

regular expression to use with –match-type=re

-q <frequency>, --frequency <frequency>

update check frequency

-x <selector>, --select <selector>

content selector

--selector <type>

selector method to use

check - check sites for updates

-f <file>, --config <file>

configuration file to read

-d <file>, --database <file>

database to store page data to. Default based on --config value, for example --config my_conf will result in a default setting of --database my_conf.db.

See Database for details of the database format.

-c <dir>, --cache <dir>

directory to store page cache

This can, and in fact should be, shared between all cupage uses.

--no-write

don’t update cache or database

--force

ignore frequency checks

-t <n>, --timeout=<n>

timeout for network operations

list - list definitions from config file

-f <file>, --config <file>

configuration file to read

-m <regex>, --match <regex>

match sites using regular expression

list-sites - list supported site values

remove - remove site from config

-f <file>, --config <file>

configuration file to read

cupage.py

Check for updates on web pages

Author:James Rowe <jnrowe@gmail.com>
Date:2010-01-23
Copyright:GPL v3
Manual section:1
Manual group:Networking

SYNOPSIS

cupage.py [option]... <command>

DESCRIPTION

cupage checks web pages and displays changes from the last run that match a given criteria. Its original purpose was to check web pages for new software releases, but it is easily configurable and can be used for other purposes.

OPTIONS

--version show program’s version number and exit
--help show this help message and exit
-v, --verbose produce verbose output
-q, --quiet output only matches and errors

COMMANDS

add

Add definition to config file

-f <file>, --config <file>
 configuration file to read
-s <site>, --site <site>
 site helper to use
-u <url>, --url <url>
 site url to check
-t <type>, --match-type <type>
 pre-defined regular expression to use
-m <regex>, --match <regex>
 regular expression to use with –match-type=re
-q <frequency>, --frequency <frequency>
 update check frequency
-x <selector>, --select <selector>
 content selector
--selector <type>
 selector method to use

check

Check sites for updates

-f <file>, --config <file>
 configuration file to read
-d <file>, --database <file>
 

database to store page data to. Default based on cupage check -f value, for example --config my_conf will result in a default setting of --database my_conf.db.

See Database for details of the database format.

-c <dir>, --cache <dir>
 

directory to store page cache

This can, and in fact should be, shared between all cupage uses.

--no-write don’t update cache or database
--force ignore frequency checks
-t <n>, --timeout=<n>
 timeout for network operations

list

List definitions from config file

-f <file>, --config <file>
 configuration file to read
-m <regex>, --match <regex>
 match sites using regular expression

list-sites

List supported site values

remove

Remove site from config

-f <file>, --config <file>
 configuration file to read

CONFIGURATION FILE

The configuration file, by default ~/.cupage.conf, is a simple INI format file, with sections defining sites to check. For example:

[spill]
url = http://www.rpcurnow.force9.co.uk/spill/index.html
select = p a
[rails]
site = vim-script
script = 1567

With the above configuration file the site named spill will be checked at http://www.rpcurnow.force9.co.uk/spill/index.html, and elements matching the CSS selector p a will be scanned for tarballs. The site named rails will be checked using the vim-script site matcher, which requires only a script value to check for updates in the scripts section of http://www.vim.org.

Various site matchers are available, see the output of cupage.py --list-sites.

BUGS

None known.

AUTHOR

Written by James Rowe

RESOURCES

Home page: http://github.com/JNRowe/cupage

COPYING

Copyright © 2009-2014 James Rowe.

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Frequently Asked Questions

Ask them, and perhaps they’ll become frequent enough to be added here ;)

API documentation

Note

The documentation in this section is aimed at people wishing to contribute to cupage, and can be skipped if you are simply using the tool from the command line.

Site

Note

The documentation in this section is aimed at people wishing to contribute to cupage, and can be skipped if you are simply using the tool from the command line.

cupage.SITES = {}
Site specific configuration data
cupage.USER_AGENT = 'cupage/0.8.2 (https://github.com/JNRowe/cupage/)'

User agent to use for HTTP requests

class cupage.Site(name, url, match_func='default', options=None, frequency=None, robots=True, checked=None, matches=None)[source]

Initialise a new Site object.

Parameters:
  • name (str) – Site name
  • url (str) – URL for site
  • match_func (str) – Function to use for retrieving matches
  • options (dict) – Options for match_func
  • frequency (int) – Site check frequency
  • robots (bool) – Whether to respect a host’s robots.txt
  • checked (datetime.datetime) – Last checked date
  • matches (list) – Previous matches
check(cache=None, timeout=None, force=False, no_write=False)[source]

Check site for updates.

Parameters:
  • cache (str) – httplib2.Http cache location
  • timeout (int) – Timeout value for httplib2.Http
  • force (bool) – Ignore configured check frequency
  • no_write (bool) – Do not write to cache, useful for testing
find_default_matches(content, charset)[source]

Extract matches from content.

Parameters:
  • content (str) – Content to search
  • charset (str) – Character set for content
find_github_matches(content, charset)[source]

Extract matches from GitHub content.

Parameters:
  • content (str) – Content to search
  • charset (str) – Character set for content
find_google_code_matches(content, charset)[source]

Extract matches from Google Code content.

Parameters:
  • content (str) – Content to search
  • charset (str) – Character set for content
find_hackage_matches(content, charset)[source]

Extract matches from hackage content.

Parameters:
  • content (str) – Content to search
  • charset (str) – Character set for content
find_sourceforge_matches(content, charset)[source]

Extract matches from sourceforge content.

Parameters:
  • content (str) – Content to search
  • charset (str) – Character set for content
static package_re(name, ext, verbose=False)[source]

Generate a compiled re for the package.

Parameters:
  • name (str) – File name to check for
  • ext (str) – File extension to check
  • verbose (bool) – Whether to enable re.VERBOSE
static parse(name, options, data)[source]

Parse data generated by Sites.loader.

Parameters:
  • name (str) – Site name from config file
  • options (configobj.ConfigObj) – Site options from config file
  • data (dict) – Stored data from database file
state

Return Site state for database storage.

class cupage.Sites[source]

Site bundle wrapper.

load(config_file, database=None)[source]

Read sites from a user’s config file and database.

Parameters:
  • config_file (str) – Config file to read
  • database (str) – Database file to read
save(database)[source]

Save Sites to the user’s database.

Parameters:database (str) – Database file to write

Examples

Reading stored configuration
>>> sites = Sites()
>>> sites.load('support/cupage.conf', 'support/cupage.db')
>>> sites[0].frequency
360000
Writing updates
>>> sites.save('support/cupage.db')

Command line

Note

The documentation in this section is aimed at people wishing to contribute to cupage, and can be skipped if you are simply using the tool from the command line.

cupage.cmdline.USAGE = '%(prog)s checks web pages and displays changes from the last run that match\na given criteria. Its original purpose was to check web pages for new software\nreleases, but it is easily configurable and can be used for other purposes.'

Command line help string, for use with argparse

cupage.cmdline.main()[source]

Main script handler.

cupage.cmdline.add()

Add new site to config.

Parameters:
  • config (str) – Location of config file
  • site (str) – Site helper to match with
  • match_type (str) – Filename match pattern
  • match (str) – Regular expression to use when match_type is re
  • frequency (str) – Update frequency
  • select (str) – Page content to check
  • site – Type of selector to use
  • name (str) – Name for new entry
cupage.cmdline.check()

Check sites for updates.

Parameters:
  • globs (dict) – Global options object
  • config (str) – Location of config file
  • database (str) – Location of database file
  • cache (str) – Location of cache directory
  • write (bool) – Whether to update cache/database
  • force (bool) – Force update regardless of frequency setting
  • frequency (datetime.timedelta) – Update frequency
  • timeout (int) – Network timeout in seconds
  • pages (list of str) – Pages to check
cupage.cmdline.list_conf()

List site definitions in config file.

Parameters:
  • config (str) – Location of config file
  • database (str) – Location of database file
  • match (str) – Display sites matching the given regular expression
  • pages (list of str) – Pages to check
cupage.cmdline.list_sites()

List built-in site matcher definitions.

Parameters:globs (dict) – Global options object
cupage.cmdline.remove()

Remove sites for config file.

Parameters:
  • globs (dict) – Global options object
  • config (str) – Location of config file
  • pages (list of str) – Pages to check

Examples

Parse command line options
>>> options, args = process_command_line()

Utilities

Note

The documentation in this section is aimed at people wishing to contribute to cupage, and can be skipped if you are simply using the tool from the command line.

class cupage.utils.CupageEncoder(skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, encoding='utf-8', default=None)[source]

Constructor for JSONEncoder, with sensible defaults.

If skipkeys is false, then it is a TypeError to attempt encoding of keys that are not str, int, long, float or None. If skipkeys is True, such items are simply skipped.

If ensure_ascii is true (the default), all non-ASCII characters in the output are escaped with uXXXX sequences, and the results are str instances consisting of ASCII characters only. If ensure_ascii is False, a result may be a unicode instance. This usually happens if the input contains unicode strings or the encoding parameter is used.

If check_circular is true, then lists, dicts, and custom encoded objects will be checked for circular references during encoding to prevent an infinite recursion (which would cause an OverflowError). Otherwise, no such check takes place.

If allow_nan is true, then NaN, Infinity, and -Infinity will be encoded as such. This behavior is not JSON specification compliant, but is consistent with most JavaScript based encoders and decoders. Otherwise, it will be a ValueError to encode such floats.

If sort_keys is true, then the output of dictionaries will be sorted by key; this is useful for regression tests to ensure that JSON serializations can be compared on a day-to-day basis.

If indent is a non-negative integer, then JSON array elements and object members will be pretty-printed with that indent level. An indent level of 0 will only insert newlines. None is the most compact representation. Since the default item separator is ‘, ‘, the output might include trailing whitespace when indent is specified. You can use separators=(‘,’, ‘: ‘) to avoid this.

If specified, separators should be a (item_separator, key_separator) tuple. The default is (‘, ‘, ‘: ‘). To get the most compact JSON representation you should specify (‘,’, ‘:’) to eliminate whitespace.

If specified, default is a function that gets called for objects that can’t otherwise be serialized. It should return a JSON encodable version of the object or raise a TypeError.

If encoding is not None, then all input strings will be transformed into unicode using that encoding prior to JSON-encoding. The default is UTF-8.

default(obj)[source]

Handle datetime objects when encoding as JSON.

This simply falls through to default() if obj has no isoformat method.

Parameters:obj – Object to encode
cupage.utils.json_to_datetime(obj)[source]

Parse checked datetimes from cupage databases.

See:json.JSONDecoder
Parameters:obj – Object to decode
cupage.utils.parse_timedelta(delta)[source]

Parse human readable frequency.

Parameters:delta (str) – Frequency to parse
cupage.utils.sort_packages(packages)[source]

Order package list according to version number.

Parameters:packages (list) – Packages to sort
cupage.utils.robots_test(http, url, name, user_agent='*')[source]

Check whether a given URL is blocked by robots.txt.

Parameters:
  • httphttplib2.Http object to use for requests
  • url (str) – URL to check
  • name – Site name being checked
  • user_agent (str) – User agent to check in robots.txt

The following three functions are defined for purely cosmetic reasons, as they make the calling points easier to read.

cupage.utils.success(text)[source]

Format a success message with colour, if possible.

Parameters:text (str) – Text to format
cupage.utils.fail(text)[source]

Format a failure message with colour, if possible.

Parameters:text (str) – Text to format
cupage.utils.warn(text)[source]

Format a warning message with colour, if possible.

Parameters:text (str) – Text to format

Examples

Output formatting
>>> success('well done!')
u'\x1b[38;5;10mwell done!\x1b[m\x1b(B'
>>> fail('unlucky!')
u'\x1b[38;5;9munlucky!\x1b[m\x1b(B'

Alternatives

Before diving in and spitting out this package I looked at the alternatives below. If I have missed something please drop me a mail.

It isn’t meant to be unbiased, and you should try the packages out for yourself. I keep it here mostly as a reference for myself, and maybe to help out people who are already familiar with one of the entries below so they can see where I’m coming from.

ck4up

ck4up is a small tool written in Ruby which scans webpages for updates by storing the hash of checked pages. It provides pretty much the same functionality as cupage, but in a slightly different manner.

The major differences are a lack of HTTP cache support, a more manual configuration method and no built in support for various hosting sites.

urlwatch

urlwatch is a great little tool, which sends you emails containing the differences in web pages. To some extent cupage is mostly a narrow subset of the functionality provided by urlwatch, and the functionality could have been implemented on top with a bunch of hooks.

In my opinion the disadvantages are a lack of HTTP cache support, the configuration requires users to write Python and no built in support for various hosting sites. The massive advantage is how configurable and hackable the tool can be thanks to the “config is a python script” design.

Release HOWTO

Test

In the general case tests can be run via nose2:

$ nose2 -vv tests

When preparing a release it is important to check that cupage works with all currently supported Python versions, and that the documentation is correct.

Prepare release

With the tests passing, perform the following steps

  • Update the version data in cupage/_version.py
  • Update NEWS.rst, if there are any user visible changes
  • Commit the release notes and version changes
  • Create a signed tag for the release
  • Push the changes, including the new tag, to the GitHub repository

Update PyPI

Create and upload the new release tarballs to PyPI:

$ ./setup.py sdist --formats=bztar,gztar register upload --sign

Fetch the uploaded tarballs, and check for errors.

You should also perform test installations from PyPI, to check the experience cupage users will have.

Appendix

class httplib2.Http

Instance of Http from httplib2

Indices and tables