storytracker

Tools for tracking stories on news homepages

Contents:

How to use it

Getting started

You can install storytracker from the Python Package Index using the command-line tool pip. If you don’t have pip installed follow these instructions. Here is all it takes.

$ pip install storytracker

You won’t need it for archival, but the analytical tools explained later one require that you have a supported web browser installed. Firefox will work, but it’s recommended that you install PhantomJS, a “headless” browser that runs behind the scenes.

On Ubuntu Linux that’s as easy as:

$ sudo apt-get install phantomjs

In Apple’s OSX you can use Homebrew to install it like so:

$ brew update && brew install phantomjs

Archiving URLs

From the command line

Once installed, you can start using storytracker’s command-line tools immediately, like storytracker.archive().

$ storytracker-archive http://www.latimes.com

That should pour out a scary looking stream of data to your console. That is the content of the page you requested compressed using gzip. If you’d prefer to see the raw HTML, add the --do-not-compress option.

$ storytracker-archive http://www.latimes.com --do-not-compress

You could save that yourself using a standard UNIX pipeline.

$ storytracker-archive http://www.latimes.com --do-not-compress > archive.html

But why do that when storytracker.create_archive_filename() will work behind the scenes to automatically come up with a tidy name that includes both the URL and a timestamp?

$ storytracker-archive http://www.latimes.com --do-not-compress --output-dir="./"

Run that and you’ll see the file right away in your current directory.

# Try opening the file you spot here with your browser
$ ls | grep .html

With Python

UNIX-like systems typically come equipped with a built in method for scheduling tasks known as cron. To utilize it with storytracker, one approach is to write a Python script that retrieves a series of sites each time it is run.

import storytracker

SITE_LIST = [
    # A list of the sites to archive
    'http://www.latimes.com',
    'http://www.nytimes.com',
    'http://www.kansascity.com',
    'http://www.knoxnews.com',
    'http://www.indiatimes.com',
]
# The place on the filesystem where you want to save the files
OUTPUT_DIR = "/path/to/my/directory/"

# Runs when the script is called with the python interpreter
# ala "$ python cron.py"
if __name__ == "__main__":
    # Loop through the site list
    for s in SITE_LIST:
        # Spit out what you're doing
        print "Archiving %s" % s
        try:
            # Attempt to archive each site at the output directory
            # defined above
            storytracker.archive(s, output_dir=OUTPUT_DIR)
        except Exception as e:
            # And just move along and keep rolling if it fails.
            print e

Scheduling with cron

Then edit the cron file from the command line.

$ crontab -e

And use cron’s custom expressions to schedule the job however you’d like. This example would schedule the script to run a file like the one above at the top of every hour. Though it assumes that storytracker is available to your global Python installation at /usr/bin/python. If you are using a virtualenv or different Python configuration, you should begin the line with a path leading to that particular python executable.

0 * * * *  /usr/bin/python /path/to/my/script/cron.py

Analyzing archived URLs

Visualizing archived URLs

Highlighted overlay

You can output a static image that pops out headlines, stories and images on the page using the ArchivedURL.write_overlay_to_directory method available on all ArchivedURL() objects.

obj = storytracker.archive("http://www.cnn.com")
obj.write_overlay_to_directory("/home/ben/Desktop")

The resulting image is sized at the same width and height of the real page. Images have a red stroke around them. Hyperlinks the system thinks link to stories have a purple border. The rest of the links go blue.

_images/overlay.png

Abstract illustration

You can output an abstract image visualizing where headlines, stories and images are on the page using the ArchivedURL.write_illustration_to_directory method available on all ArchivedURL() objects. The following code will write a new image of the CNN homepage to my desktop.

obj = storytracker.archive("http://www.cnn.com")
obj.write_illustration_to_directory("/home/ben/Desktop")

The resulting image is sized at the same width and height of the real page, with images colored red. Hyperlinks are colored in too. If our system thinks the link leads to a news story, it’s filled in purple. Otherwise it’s colored blue.

_images/illo.jpg

Ingesting archived URLs from the Wayback Machine

A page saved by the Internet Archive’s excellent Wayback Machine can be integrated by passing its URL to storytracker.open_wayback_machine_url().

This pulls down the CNN homepage captured on Sept. 11, 2001.

>>> import storytracker
>>> obj = storytracker.open_wayback_machine_url('https://web.archive.org/web/20010911213814/http://www.cnn.com/')

Now you have an ArchivedURL object like any other in the storytracker system.

>>> obj
<ArchivedURL: http://www.cnn.com/@2001-09-11 21:38:14>

So, if for instance you wanted to see all the images on the page you could do this.

>>> for i in obj.images:
>>>     print i.src
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
https://web.archive.org/web/20010911213814id_/http://www.cnn.com/images/newmain/top.main.special.report.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
https://web.archive.org/web/20010911213814id_/http://www.cnn.com/images/newmain/header.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com/images/0109/top.exclusive.jpg
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com//images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com/images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com/images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com/images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com/images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com/images/hub2000/1.gif
http://a388.g.akamai.net/f/388/21/1d/www.cnn.com/images/hub2000/1.gif

Python interfaces

Archiving

Tools to download and save URLs.

archive

Archive the HTML from the provided URLs

storytracker.archive(url, verify=True, minify=True, extend_urls=True, compress=True, output_dir=None)
Parameters:
  • url (str) – The URL of the page to archive
  • verify (bool) – Verify that HTML is in the response’s content-type header
  • minify (bool) – Minify the HTML response to reduce its size
  • extend_urls (bool) – Extend relative URLs discovered in the HTML response to be absolute
  • compress (bool) – Compress the HTML response using gzip if an output_dir is provided
  • output_dir (str or None) – Provide a directory for the archived data to be stored
Returns:

An ArchivedURL object

Return type:

ArchivedURL

Raises ValueError:
 

If the response is not verified as HTML

Example usage:

>>> import storytracker

>>> # This will return gzipped content of the page to the variable
>>> obj = storytracker.archive("http://www.latimes.com")
<ArchivedURL: http://www.latimes.com@2014-07-17 04:08:32.169810+00:00>

>>> # You can save it to an automatically named file a directory you provide
>>> obj = storytracker.archive("http://www.latimes.com", output_dir="./")
>>> obj.archive_path
'./http!www.latimes.com!!!!@2014-07-17T04:09:21.835271+00:00.gz'

get

Retrieves HTML from the provided URLs

storytracker.get(url, verify=True)
Parameters:
  • url (str) – The URL of the page to archive
  • verify (bool) – Verify that HTML is in the response’s content-type header
Returns:

The content of the HTML response

Return type:

str

Raises ValueError:
 

If the response is not verified as HTML

Example usage:

>>> import storytracker

>>> html = storytracker.get("http://www.latimes.com")

Analysis

ArchivedURL

An URL’s archived HTML with tools for analysis.

class ArchivedURL(url, timestamp, html, gzip_archive_path=None, html_archive_path=None, browser_width=1024, browser_height=768, browser_driver="PhantomJS")

Initialization arguments

url

The url archived

timestamp

The date and time when the url was archived

html

The HTML archived

Optional initialization options

gzip_archive_path

A file path leading to an archive of the URL stored in a gzipped file.

html_archive_path

A file path leading to an archive of the URL storied in a raw HTML file.

browser_width

The width of the browser that will be opened to inspect the URL’s HTML By default it is 1024.

browser_height

The height of the browser that will be opened to inspect the URL’s HTML By default is 768.

browser_driver

The name of the browser that Selenium will use to open up HTML files. By default it is PhantomJS.

Other attributes

height

The height of the page in pixels after the URL is opened in a web browser

width

The width of the page in pixels after the URL is opened in a web browser

gzip

Returns the archived HTML as a stream of gzipped data

archive_filename

Returns a file name for this archive using the conventions of storytracker.create_archive_filename().

A list of all the hyperlinks extracted from the HTML

images

A list of all the images extracts from the HTML

largest_headline

Returns the story hyperlink with the largest area on the page. If there is a tie, returns the one that appears first on the page.

largest_image

The largest image extracted from the HTML

A list of all the hyperlinks extracted from the HTML that are estimated to lead to news stories.

summary_statistics

Returns a dictionary with basic summary statistics about hyperlinks and images on the page

Analysis methods

analyze()

Opens the URL’s HTML in a web browser and runs all of the analysis methods that use it.

get_cell(x, y, cell_size=256)

Returns the grid cell where the provided x and y coordinates appear on the page. Cells are sized as squares, with 256 pixels as the default.

The value is returned in the style of algebraic notation used in a game of chess.

>>> obj.get_cell(1, 1)
'a1'
>>> obj.get_cell(257, 1)
'b1'
>>> obj.get_cell(1, 513)
'a3'

Returns the Hyperlink object that matches the submitted href, if it exists.

open_browser()

Opens the URL’s HTML in an web browser so it can be analyzed.

close_browser()

Closes the web browser opened to analyze the URL’s HTML

Output methods

Returns the provided file object with a ready-to-serve CSV list of all hyperlinks extracted from the HTML.

write_gzip_to_directory(path)

Writes gzipped HTML data to a file in the provided directory path

write_html_to_directory(path)

Writes HTML data to a file in the provided directory path

write_illustration_to_directory(path)

Writes out a visualization of the hyperlinks and images on the page as a JPG to the provided directory path.

Example usage:

>>> import storytracker

>>> obj = storytracker.open_archive_filepath('/home/ben/archive/http!www.latimes.com!!!!@2014-07-06T16:31:57.697250.gz')
>>> obj.url
'http://www.latimes.com'

>>> obj.timestamp
datetime.datetime(2014, 7, 6, 16, 31, 57, 697250)

ArchivedURLSet

A list of ArchivedURL objects.

class ArchivedURLSet(list)

List items added to the set must be unique ArchivedURL objects.

Parses all of the hyperlinks from the HTML of all the archived URLs and returns a list of the distinct href hyperlinks with a series of statistics attached that describe how they are positioned.

summary_statistics

Returns a dictionary of summary statistics about the whole set of archived URLs.

print_href_analysis(href)

Outputs a human-readable analysis of the submitted href’s position across the set of archived URLs.

write_href_gif_to_directory(href, path, duration=0.5)

Writes out animation of a hyperlinks on the page as a GIF to the provided directory path

Returns the provided file object with a ready-to-serve CSV list of all hyperlinks extracted from the HTML.

Example usage:

>>> import storytracker

>>> obj_list = storytracker.open_archive_directory('/home/ben/archive/')

>>> obj_list[0].url
'http://www.latimes.com'

>>> obj_list[1].timestamp
datetime.datetime(2014, 7, 6, 16, 31, 57, 697250)

Image

class Image(src)

An image extracted from an archived URL.

Initialization arguments

src

The src attribute of the image tag

x

The x coordinate of the object’s location on the page.

y

The y coordinate of the object’s location on the page.

width

The width of the object’s size on the page.

height

The height of the object’s size on the page.

cell

The grid cell where the provided x and y coordinates appear on the page. Cells are sized as squares, with 256 pixels as the default.

The value is returned in the style of algebraic notation used in a game of chess.

Analysis methods

area

Returns the square area of the image

orientation

Returns a string describing the shape of the image.

‘square’ means the width and height are equal

‘landscape’ is a horizontal image with width greater than height

‘portrait’ is a vertical image with height greater than width None means there are no size attributes to test

File handling

Functions for naming, saving and retrieving archived URLs.

create_archive_filename

Returns a string that combines a URL and a timestamp of for naming archives saved to the filesystem.

storytracker.create_archive_filename(url, timestamp)
Parameters:
  • url (str) – The URL of the page that is being archived
  • timestamp (datetime) – A timestamp recording approximately when the URL was archive
Returns:

A string that combines the two arguments into a structure can be reversed back into Python

Return type:

str

Example usage:

>>> import storytracker
>>> from datetime import datetime
>>> storytracker.create_archive_filename("http://www.latimes.com", datetime.now())
'http!www.latimes.com!!!!@2014-07-06T16:31:57.697250'

open_archive_directory

Accepts a directory path and returns an ArchivedURLSet list filled with an ArchivedURL object that corresponds to every archived file it finds.

storytracker.open_archive_directory(path)
Parameters:path (str) – The path to directory containing archived files.
Returns:An ArchivedURLSet list
Return type:ArchivedURLSet

Example usage:

>>> import storytracker
>>> obj_list = storytracker.open_archive_directory('/home/ben/archive/')

open_archive_filepath

Accepts a file path and returns an ArchivedURL object

storytracker.open_archive_filepath(path)
Parameters:path (str) – The path to the archived file. Its file name must conform to the conventions of storytracker.create_archive_filename().
Returns:An ArchivedURL object
Return type:ArchivedURL
Raises ArchiveFileNameError:
 If the file’s name cannot be parsed using the conventions of storytracker.create_archive_filename().

Example usage:

>>> import storytracker
>>> obj = storytracker.open_archive_filepath('/home/ben/archive/http!www.latimes.com!!!!@2014-07-06T16:31:57.697250.gz')

open_wayback_machine_url

Accepts a URL from the Internet Archive’s Wayback Machine and returns an ArchivedURL object

storytracker.open_wayback_machine_url(url)
Parameters:url (str) – A URL from the Wayback Machine that links directly to an archive. An example is https://web.archive.org/web/20010911213814/http://www.cnn.com/.
Returns:An ArchivedURL object
Return type:ArchivedURL
Raises ArchiveFileNameError:
 If the file’s name cannot be parsed.

Example usage:

>>> import storytracker
>>> obj = storytracker.open_wayback_machine_url('https://web.archive.org/web/20010911213814/http://www.cnn.com/')

reverse_archive_filename

Accepts a filename created using the rules of storytracker.create_archive_filename() and converts it back to Python. Returns a tuple: The URL string and a timestamp. Do not include the file extension when providing a string.

storytracker.reverse_archive_filename(filename)
Parameters:filename (str) – A filename structured using the style of the storytracker.create_archive_filename() function
Returns:A tuple containing the URL of the archived page as a string and a datetime object of the archive’s timestamp
Return type:tuple

Example usage:

>>> import storytracker
>>> storytracker.reverse_archive_filename('http!www.latimes.com!!!!@2014-07-06T16:31:57.697250')
('http://www.latimes.com', datetime.datetime(2014, 7, 6, 16, 31, 57, 697250))

reverse_wayback_machine_url

Accepts an url from the Internet Archive’s Wayback Machine and returns a tuple with the archived URL string and a timestamp.

storytracker.reverse_wayback_machine_url(url)
Parameters:url (str) –

A URL from the Wayback Machine that links directly to an archive. An example is https://web.archive.org/web/20010911213814/http://www.cnn.com/.

Returns:A tuple containing the URL of the archived page as a string and a datetime object of the archive’s timestamp
Return type:tuple

Example usage:

>>> import storytracker
>>> storytracker.reverse_wayback_machine_url('https://web.archive.org/web/20010911213814/http://www.cnn.com/')
('http://www.cnn.com/', datetime.datetime(2001, 9, 11, 21, 38, 14))

Command-line interfaces

storytracker-archive

Usage: storytracker-archive [URL]... [OPTIONS]

Archive the HTML from the provided URLs

Options:
  -h, --help            show this help message and exit
  -v, --do-not-verify   Skip verification that HTML is in the response's
                        content-type header
  -m, --do-not-minify   Skip minification of HTML response
  -e, --do-not-extend-urls
                        Do not extend relative urls discovered in the HTML
                        response
  -c, --do-not-compress
                        Skip compression of the HTML response
  -d OUTPUT_DIR, --output-dir=OUTPUT_DIR
                        Provide a directory for the archived data to be stored

Example usage:

# This will pipe out gzipped content of the page to stdout
$ storytracker-archive http://www.latimes.com

# You can save it to an automatically named file a directory you provide
$ storytracker-archive http://www.latimes.com -d ./

# If you'd prefer to have the HTML without compression
$ storytracker-archive http://www.latimes.com -c

# Which of course can be piped into other commands like anything else
$ storytracker-archive http://www.latimes.com -cm | grep lakers

storytracker-get

Usage: storytracker-get [URL]... [OPTIONS]

Retrieves HTML from the provided URLs

Options:
  -h, --help           show this help message and exit
  -v, --do-not-verify  Skip verification that HTML is in the response's
                       content-type header

Example usage:

# Download an url like this
$ storytracker-get http://www.latimes.com

# Or two like this
$ storytracker-get http://www.latimes.com http://www.columbiamissourian.com

storytracker-links2csv

Usage: storytracker-links2csv [ARCHIVE PATHS OR DIRECTORIES]...

Extracts hyperlinks from archived files or streams and outputs them as comma-
delimited values

Options:
  -h, --help  show this help message and exit

Example usage:

# Extract from an archived file
$ storytracker-links2csv /path/to/my/directory/http!www.cnn.com!!!!@2014-07-22T04:18:21.751802+00:00.html

# Extract from a directory filled with archived file
$ storytracker-links2csv /path/to/my/directory/

Changelog

0.0.9

  • Created a new method to write out an visualization of the page as an image file.

0.0.8

  • Refactored analysis tools to use Selenium and PhantomJS rather than BeautifulSoup, which allowed for a whole of size and location attributes to be parsed from the fully rendered HTML document.

0.0.7

  • Added open_wayback_machine_url and reverse_wayback_machine_url functions

to introduce support for files saved by the Internet Archive’s Wayback Machine.

0.0.6

  • is_story estimate added to each Hyperlink object as an attribute

0.0.5

  • Hyperlink and Image classes
  • hyperlink and images methods that extract them from ArchivedURL
  • write_hyperlinks_csv_to_file method on ``ArchivedURL for outputs
  • storytracker-links2csv command-line interface

0.0.4

  • Timestamping of archive method now includes timezone, set to UTC by default

0.0.3

  • More forgiving urlparse imports that work in both Python 2 and Python 3

0.0.2

  • Changed automatic file naming process to work better with long file names
  • Added basic logging to the archival functions

0.0.1

  • Python functions for retrieving and saving URLs
  • Command line tools for interactive with those function

Credits

This is a joint project of PastPages.org, The Reynolds Journalism Institute and the University of Missouri.

The lead developer is Ben Welsh.

Sponsors

Reynolds Journalism Institute
PastPages