Scrapple 0.3.0 documentation

The Internet is a huge source of information. Several people may use data from the Internet to perform various activities, like research or analysis. Data extraction is a primary step involved in data mining and analysis. Extracting content from structured web pages is a vital task to be performed when the Internet is the principal source of data.

The current standards in web structure involve the use of CSS selectors or XPath expressions to select particular tags from which information can be extracted. Web pages are structured as element trees which can be parsed to traverse through the tags. This tree structure, which represents tags as parent/children/siblings, is very useful when tags should be represented in terms of the rest of the web page structure.

Scrapple is a project aimed at designing a framework for building web content extractors. Scrapple uses key-value based configuration files to define parameters to be considered in generating the extractor. It considers the base page URL, selectors for the data to be extracted, and the selector for the links to be crawled through. At its core, Scrapple abstracts the implementation of the extractor, focussing more on representing the selectors for the required tags. Scrapple can be used to generate single page content extractors or link crawlers.

Overview

Web content extraction is a common task in the process of collecting data for data analysis. There are several existing frameworks that aid in this task. In this chapter, a brief introduction of Scrapple is provided, with instructions on setting up the development machine to run Scrapple.

Introducing Scrapple

The Internet is a huge source of information. Several people may use data from the Internet to perform various activities, like research or analysis. However, there are two primary issues involved with using data from the Internet :

  • You may not have any way to get information from a particular website, i.e, it may not provide an API for accessing the data.
  • Even if an API is provided, it may not give all the data needed. It is possible that there may be some data that is present on the web interface, but not provided through the API.

This is where web scrapers and web crawlers come in.

Web scrapers & web crawlers

Web scrapers, also called extractors, are used to extract content from any particular page. They may use CSS Selectors or XPath expressions to point to a particular tag on the HTML structure of the page, and extract the content from that tag. The content extracted could be text from <div> tags, links from <a> tags, and so on.

Web crawlers are scripts that go through multiple links from a single base page. Given a base URL, it uses this page as an index page to many different pages. It goes through each of these pages, extracting the required content along the way.

Scrapers and crawlers can be written to extract necessary content from any page that you would need information from.

How Scrapple helps

Scrapple helps to reduce the hassle in manually writing the scripts needed to extract the required content. It involves the use of a configuration file that specifies property-value pairs of various parameters involved in constructing the required script.

The configuration file is a JSON document, consisting of the required key-value pairs. The user specifies the base URL of the page to work on, and also the tags of the data to be extracted. The user has a choice between CSS selectors and XPath expressions for specifying the target tags. Once the target tags have been specified, the other parameters are filled and the configuration file is completed. This configuration file is used by Scrapple to generate the required script and the execution is performed to generate the output of the scraper as a CSV/JSON document (depending on the argument passed while running the script). Thus, the user can obtain data they need without having extensive programming expertise to manually write the scripts required.

The inspiration behind Scrapple

Scrapple is based on the best ideas involved in two projects :

  • Scrapy : Scrapy is an application framework, designed to build web spiders that extract structured web data.
  • Ducky : Ducky is a semi-automatic web wrapper, which uses a configuration file to define extraction rules, and extract data accordingly.

Review of existing systems

Data extraction process from the web can be classified based on the selectors used. Selectors can be CSS or XPath expressions. CSS selectors are said to be faster and are used by many browsers. Ducky [1] uses CSS selectors for extracting data from pages that are similarly structured.

On the other hand, XPath expressions are more reliable, handles text recognition better and a powerful option to locate elements when compared to CSS selectors. Many researches are going on presently in this topic. Oxpath [2] provides an extension for XPath expressions. The system created by V. Crescenzi, P. Merialdo, and D. Qiu [3] uses XPath expressions for locating the training data to create queries posed to the workers of a crowd sourcing platform.

Systems like Ducky and Deixto [4] use the concept of Configuration files where the user inputs the simple details like base pages, a “next” column if there are multiple pages to be parsed. Deixto uses the concept of tag filtering where the unnecessary html tags can be ignored when the DOM (Document Object Model) tree is created.

Scrapy [5], an open source project, provides the framework for web crawlers and extractors. This framework provides support for spider programs that are manually written to extract data from the web. It uses XPath expression to locate the content. The output formats of Ducky and Scrapy include XML, CSV and JSON files.

[1]Kei Kanaoka, Yotaro Fujii and Motomichi Toyama. Ducky: A Data Extraction System for Various Structured Web Documents. In Proceedings of the 18th International Database Engineering & Applications Symposium, IDEAS ’14, pages 342-347, New York, NY, USA, 2014. ACM
[2]T.Furche, G.Gottlob, G.Grasso, C.Schallhart, and A.Sellers. Oxpath: A language for scalable data extraction, automation, and crawling on the deep web. The VLDB Journal, 22(1):47–72, Feb. 2013
[3]V.Crescenzi, P.Merialdo, and D.Qiu. Alfred: Crowd assisted data extraction. In Proceedings of the 22nd International Conference on World Wide Web Companion, WWW ’13 Companion, pages 297–300, Republic and Canton of Geneva, Switzerland, 2013. International World Wide Web Conferences Steering Committee.
[4]F.Kokkoras, K.Ntonas, and N.Bassiliades. Deixto: A web data extraction suite. In Proceedings of the 6th Balkan Conference in Informatics, BCI ’13, pages 9–12, New York, NY, USA, 2013. ACM.
[5]Scrapy: A fast and powerful scraping and web crawling framework.

System requirements

Scrapple is a Python package/command line tool which runs on all operating systems that support Python. This includes :

  • Windows XP
  • Windows Vista
  • Windows 7
  • Windows 8.x
  • Common Linux distros : Ubuntu/Xubuntu/Lubuntu, Fedora, Mint, Gentoo, openSUSE, Arch Linux etc.
  • OS X

The basic requirements for running Scrapple are:

  • Python 2.7 or 3.x
  • pip or easy_install for installing the necessary Python packages [pip is the recommended choice]

Scrapple depends on a number of Python packages for various parts of its execution :

  • requests : The HTTP library. The requests library is used to make HTTP requests to load the required web pages.
  • lxml : The web scraping library. The lxml library is used to parse the element tree and extract the required content.
  • cssselect : The CSS selector library. cssselect works in tandem with lxml to handle CSS Selector expressions.
  • docopt : The command line parser. The docopt library is used to parse the command line interface input based on the CLI usage specification in the docstring.
  • Jinja2 : The templating engine. Jinja2 is used to create skeleton configuration file and generated Python scraper scripts.
  • Flask : The web micro-framework. The web interface to edit configuration files runs on Flask.
  • colorama : The output formatter. colorama is used to format the various sections of the command line output.

Install Scrapple

The requirements covers the system requirements in detail.

If you’re running Ubuntu, install the necessary C libraries for the lxml module.

$ sudo apt-get install libxml2-dev libxslt-dev python-dev lib32z1-dev

If you’re running any other Linux distro, follow the standard install procedures and install these libraries.

You may not have to do this on your Windows machine.

Install the requirements for running Scrapple with

$ pip install -r requirements.txt

If this fails because of the access privileges, run the command with sudo.

You can then install Scrapple with

$ pip install scrapple or $ sudo pip install scrapple

To verify that Scrapple has been installed correctly, try the scrapple command from the command line.

$ scrapple --version

This should display the version of Scrapple installed.

Introducing Scrapple
An introduction to Scrapple
Review of existing systems
A review of existing systems
System requirements
Hardware and software requirements to run Scrapple
Install Scrapple
Instructions for installing Scrapple and the required dependencies

Concepts

Creating web content extractors requires a good understanding of the following topics :

In this chapter, a brief overview of the concepts behind Scrapple is given.

Web page structure

Structure of a Web page

The main elements that comprise a web page are :

  • DOCTYPE: This lets the browser know the type of markup language the page is written in.
  • Document Tree: We can consider a page as a document tree that contain any number of branches.
  • HTML: This is the root element of the document tree and everything that follows is a child node. HTML has two descendants – HEAD and BODY
  • HEAD: It contains the title and the information of the page.
  • BODY: It contains the data displayed by the page.

ElementTree

The Element type [6] is a data object that can contain tree-like data structures.

The ElementTree wrapper [6] type adds code to load web pages as trees of Element objects. An element consists of properties like a tag(identify the element type), number of attributes, text string holding the textual content and the number of child nodes.

To create a tree, we create the root element and add children elements to the root element. A method called Subelement can be used for creating and adding an element to the parent element. Few methods that are provided to search for Subelements are as follows:

  • find(pattern) – Return the first subelement matching the pattern
  • findtext(pattern) – Returns the value of the text attribute of the first subelement matching the pattern
  • findall(pattern) – Return a list matching the pattern
  • getiterator(tag) - Return a list matching the tag attribute
  • getiterator() – Return a list of all the Subelements

Selector expressions

Selector expressions are used to point to specific groups of tags on a web page. They involve using the tag type and tag attributes to select certain sections of the page.

There are two main types of selector expressions that are commonly used, and supported by Scrapple :

  • CSS selectors
  • XPath expressions

CSS Selectors

CSS selectors [7] offer flexible options to design efficient and dynamic tag queries, providing variety of expression and simplicity. They use tag names and tag attributes to represent the traversal through the DOM tree.

For example,

div#sidebar dl dd a refers to an anchor tag that is under a definition list in the <div> with a id ‘sidebar’.

In recent times, the use of CSS selectors is becoming more common because of the development of Javascript libraries (like jQuery and React.js) which use CSS selectors to access particular tags in the web page structure.

CSS selectors are faster in many browsers. They rely on the structure as well as the attributes of the page to find elements. They provide a good balance between structure and attributes. Though the property values cannot be specified as expressions, it is widely used due to its simplicity and flexibility. They can traverse down a DOM(Document Object Model) according to the path specified in the selector expression.

A detailed study on CSS selectors can be found on W3C.

XPath Selectors

XPath [8] is used for navigating documents and selecting nodes. It is based on a tree representation [9] of the document it is to be traversed. Since it can traverse both up and down the DOM, it is used widely used for easy navigation through the page to locate the elements searched for. Here are few simple examples of XPath expressions and their meaning.

  • /html/body/p[1] : This finds the first p inside the body section of the html
  • //link : This selects all the link tags in the page
  • //div[@id="commands"] : This selects the div elements which contain the id=”commands”

XPath expressions can be simplified by looking up for path having unique attributes in the page. For instance, consider the expression

//*[@id="guide-container"]/div/ul/li[1]/ \
div/ul/li[@id="what_to_watch-guide-item"]/a

This can be simplified further as //*[@id="what_to_watch-guide-item"]/a based on the unique id=”what_to_watch-guide-item” attribute used.

They can also be simplified by shortening the reference path. For instance, consider the same expression as the previous example. This expression can be shortened as //*[@id="guide-container"]//li[@id="what_to_watch-guide-item"]/a where the ‘//’ refers to the shortened path.

There are several other advanced methods of representing XPath expressions, like expressing the ancestors or siblings, normalizing spaces etc.

A detailed study on XPath expressions can be found on W3C.

Data formats

There are several data formats that are used to handle data. This includes XML, CSV, JSON, etc. Scrapple provides support for storing extracted data in two formats :

  • Javascript Object Notation (JSON)
  • Comma Separated Values (CSV)

JSON

Javascript Object Notation (JSON) files are easy to understand and create. They are easy to parse through, understand and write. It is a language independent format and hence many of the APIs use them as a data-interchange format.

Few data types in JSON are :

  • Object: It is an unordered set of name/value pairs.
  • Array: It is a set of values of same data type. It is enclosed in a square bracket and the name-value pairs are separated by a comma.
  • Name: It is the field that describes the data.
  • Value: It is the input data for the name attribute. It can be a number, a Boolean value(true or false), a character(inserted between single quotes) or a string(inserted between double quotes).

For example,

{

        "subject": "Computer Science",
        "data": [
        # Array

                {
                # Object

                        "name": "John",         # String
                        "marks": 96,            # Integer
                        "passed": true          # Boolean

                },

                {

                        "name": "Doe",
                        "marks": 33,
                        "passed": false

                }

        ]

}

CSV

Comma Separated Values (CSV) files consists of tabular data where the fields are separated by a comma and the records by a line. It is stored in plain-text format. CSV files are easy to handle and manipulate.

For example,

Name Marks Grade Promotion
John 96 O True
Doe 45 F False

can be represented as,

John,96,O,True
Doe,45,F,False
Web page structure
The basics of web page structure and element trees
Selector expressions
An introduction to tag selector expressions
Data formats
The primary data formats involved in handling data

The Scrapple framework

This section deals with how Scrapple works - the architecture of the Scrapple framework, the commands and options provided by the framework and the specification of the configuration file.

Scrapple architecture

Scrapple provides a command line interface (CLI) to access a set of commands which can be used for implementing various types of web content extractors. The basic architecture of Scrapple explains how the various components are related.

Scrapple architecture

Scrapple architecture

  • Command line input
    The command line input is the basis of definition of the implementation of the extractor. It specifies the project configuration and the options related to implementing the extractor.
  • Configuration file
    The configuration file specifies the rules of the required extractor. It contains the selector expressions for the data to be extracted and the specification of the link crawler.
  • Extractor framework
    The extractor framework handles the implementation of the parsing & extraction. The extractor framework follows the following steps :
    • It makes HTTP requests to fetch the web page to be parsed.
    • It parses through the element tree.
    • It extracts the required content, depending on the extractor rules in the configuration file.
    • In case of crawlers, this process is repeated for all the pages that the extractor crawls through.
  • Data format handler
    According to the options specified in the CLI input, the extracted content is stored as a CSV document or a JSON document.

Scrapple commands

The four commands provided by the Scrapple CLI are

  • genconfig
  • generate
  • run
  • web

genconfig

The genconfig command is used to create a skeleton configuration file, which can be used as the base for further writing the necessary configuration file. This makes it easier to understand the structure of the key/value-based configuration file, and provide the necessary options.

The two positional arguments for the genconfig command are :

  • The project name
  • The base URL

The genconfig command creates a basic configuration file with the provided base URL, and creates it as “<project_name>.json”.

The two optional arguments for the genconfig command are :

  • The type of extractor
  • The selector type
  • Number of crawl levels

The extractor could be a scraper or a crawler, and this can be specified in the –type argument. By default, it is a scraper. When the crawler option is provided, it adds the “next” parameter in the skeleton configuration file.

With the –selector option, the selector type to be used can be specified. This can be “css” or “xpath”. By default, it is “xpath”.

For crawler configuration specifications, the number of levels of the crawler can be specified using the –levels argument. By default, it is 1. According to the number of levels specified, nested “next” fields are created.

Examples :

  • $ scrapple genconfig pyvideo http://pyvideo.org/category creates pyvideo.json, which contains the skeleton configuration file for a scraper which uses XPath expressions as the selector.
  • $ scrapple genconfig pyvideo http://pyvideo.org/category --type=crawler creates pyvideo.json, which contains the skeleton configuration file for a single level crawler which uses XPath expressions as the selector.
  • $ scrapple genconfig pyvideo http://pyvideo.org/category --type=crawler --selector=css --levels=2 creates pyvideo.json, which contains the skeleton configuration file for a 2 level crawler which uses CSS selector expressions as the selector.

generate

The generate command is used to generate the Python script corresponding to the specifications in the configuration file. This command is used to create the script that replicates the operation of the run command.

The two positional arguments for the generate command are :

  • The project name
  • The output file name

The project name is the name of the configuration file to be used, i.e, “<project_name>.json” is the configuration file used as the specification. The command creates “<output_file_name>.py” as the generated Python script.

The one available optional argument is :

  • The output type

This specifies the output format in which the extracted content is to be stored. This could be “csv” or “json”. By default, it is “json”.

Examples :

  • $ scrapple generate pyvideo talk1 generates talk1.py based on pyvideo.json, where the extracted data is stored in a JSON document.
  • $ scrapple generate pyvideo talk1 --output_type=csv generates talk1.py based on pyvideo.json, where the extracted data is stored in a CSV document.

run

The run command is used to run the extractor corresponding to the specifications in the configuration file. This command runs the extractors and stores the extracted content for later use.

The two positional arguments for the generate command are :

  • The project name
  • The output file name

The project name is the name of the configuration file to be used, i.e, “<project_name>.json” is the configuration file used as the specification. The command creates “<output_file_name>.json” or “<output_file_name>.csv” which contains the extracted content.

The one available optional argument is :

  • The output type

This specifies the output format in which the extracted content is to be stored. This could be “csv” or “json”. By default, it is “json”.

Examples :

  • $ scrapple run pyvideo talk1 runs the extractor based on pyvideo.json, and stores the extracted content in talk1.json.
  • $ scrapple run pyvideo talk1 --output_type=csv runs the extractor based on pyvideo.json, and stores the extracted content in talk1.csv.

web

The web command is an added feature, to make it easier to edit the configuration file. It provides a web interface, which contains a form where the configuration file can be filled. It currently supports only editing configuration files for scrapers. Future work includes support for editing configuration files for link crawlers.

The web interface can be opened with the command

$ scrapple web

This starts a Flask web app, which opens on port 5000 on the localhost.

Configuration file

The configuration file is the basic specification of the extractor required. It contains the URL for the web page to be loaded, the selector expressions for the data to be extracted and in the case of crawlers, the selector expression for the links to be crawled through.

The keys used in the configuration file are :

  • project_name : Specifies the name of the project with which the configuration file is associated.
  • selector_type : Specifies the type of selector expressions used. This could be “xpath” or “css”.
  • scraping : Specifies parameters for the extractor to be created.
    • url : Specifies the URL of the base web page to be loaded.
    • data : Specifies a list of selectors for the data to be extracted.
      • selector : Specifies the selector expression.
      • attr : Specifies the attribute to be extracted from the result of the selector expression.
      • field : Specifies the field name under which this data is to stored.
      • default : Specifies the default value to be used if the selector expression fails.
    • table : Specifies a description for scraping tabular data.
      • table_type : Specifies the type of table (“rows” or “columns”). This determines the type of table to be extracted. A row extraction is when there is a single row to be extracted and mapped to a set of headers. A column extraction is when a set of rows have to be extracted, giving a list of header-value mappings.
      • header : Specifies the headers to be used for the table. This can be a list of headers, or a selector that gives the list of headers.
      • prefix : Specifies a prefix to be added to each header.
      • suffix : Specifies a suffix to be added to each header.
      • selector : Specifies the selector for the data. For row extraction, this is a selector that gives the row to be extracted. For column extraction, this is a list of selectors for each column.
      • attr : Specifies the attribute to be extracted from the selected tag.
      • default : Specifies the default value to be used if the selector does not return any data.
    • next : Specifies the crawler implementation.
      • follow_link : Specifies the selector expression for the <a> tags to be crawled through.

The main objective of the configuration file is to specify extraction rules in terms of selector expressions and the attribute to be extracted. There are certain set forms of selector/attribute value pairs that perform various types of content extraction.

Selector expressions :

  • CSS selector or XPath expressions that specify the tag to be selected.
  • “url” to take the URL of the current page on which extraction is being performed.

Attribute selectors :

  • “text” to extract the textual content from that tag.
  • “href”, “src” etc., to extract any of the other attributes of the selected tag.
Scrapple architecture
The architecture of the Scrapple framework
Scrapple commands
Commands provided by the Scrapple CLI
Configuration file
The configuration file which is used by Scrapple to implement the required extractor/crawler

Implementation

This section deals with the implementation of the Scrapple framework. This includes an explanation of the classes involved in the framework, the interaction scenarios for each of the commands supported by Scrapple, and utility functions that form a part of the implementation of the extractor.

Scrapple implementation classes

There are two main categories of classes involved in the implementation of extractors on Scrapple :

Command classes define the execution of the commands provided by Scrapple.

Selector classes define the implementation of the extraction through the selector expressions.

The following class diagram can be used to represent the relationship between the various classes.

Scrapple class diagram

Scrapple class diagram

Interaction scenarios

The primary use cases in Scrapple are the execution of the commands provided by the framework. A general idea of the execution of these commands and the relation between the various modules of the framework can be understood through a study of the interaction scenarios for each of the commands.

Basic sequence diagrams for the execution for each command can be represented as such. A more detailed explanation of the execution of the commands is provided in the commands implementation section.

Command line interface

Scrapple is primarily run through a command line interface. The CLI is used to execute the commands supported by Scrapple. For a description of the usage of the Scrapple CLI, the help option can be used.

$ scrapple --help

This presents the usage description and an explanation of the optional arguments provided by the commands.

Usage:

scrapple (-h | –help | –version)

scrapple genconfig <projectname> <url> [–type=<type>] [–selector=<selector>]

scrapple run <projectname> <output_filename> [–output_type=<output_type>]

scrapple generate <projectname> <output_filename> [–output_type=<output_type>]

scrapple web

Options:
-h, --help Show this help message and exit
--version Display the version of Scrapple
--type=<type>, -t <type>
 Specifies if the script generated is a page scraper or a crawler [default: scraper]
--selector=<selector>, -s <selector>
 Specifies if XPath expressions or CSS selectors are used [default: xpath]
--output_type=<output_type>, -o <output_type>
 Specifies if the generated output is stored as CSV or JSON [default: json]

The scrapple tool on the command line is included in the system path when Scrapple is installed. When the tool is run, the input on the CLI is parsed by the runCLI() function.

scrapple.cmd.runCLI()[source]

The starting point for the execution of the Scrapple command line tool.

runCLI uses the docstring as the usage description for the scrapple command. The class for the required command is selected by a dynamic dispatch, and the command is executed through the execute_command() method of the command class.

This functionality is implemented by the following code block -

handle_exceptions(args)
command_list = ['genconfig', 'run', 'generate', 'web']
select = itemgetter('genconfig', 'run', 'generate', 'web')
selectedCommand = command_list[select(args).index(True)]
cmdClass = get_command_class(selectedCommand)
obj = cmdClass(args)
obj.execute_command()

Command classes

scrapple.commands.genconfig

class scrapple.commands.genconfig.GenconfigCommand(args)[source]

Defines the execution of genconfig

execute_command()[source]

The genconfig command depends on predefined Jinja2 templates for the skeleton configuration files. Taking the –type argument from the CLI input, the corresponding template file is used.

Settings for the configuration file, like project name, selector type and URL are taken from the CLI input and using these as parameters, the template is rendered. This rendered JSON document is saved as <project_name>.json.

scrapple.commands.generate

class scrapple.commands.generate.GenerateCommand(args)[source]

Defines the execution of generate

execute_command()[source]

The generate command uses Jinja2 templates to create Python scripts, according to the specification in the configuration file. The predefined templates use the extract_content() method of the selector classes to implement linear extractors and use recursive for loops to implement multiple levels of link crawlers. This implementation is effectively a representation of the traverse_next() utility function, using the loop depth to differentiate between levels of the crawler execution.

According to the –output_type argument in the CLI input, the results are written into a JSON document or a CSV document.

The Python script is written into <output_filename>.py - running this file is the equivalent of using the Scrapple run command.

scrapple.commands.run

class scrapple.commands.run.RunCommand(args)[source]

Defines the execution of run

execute_command()[source]

The run command implements the web content extractor corresponding to the given configuration file.

The execute_command() validates the input project name and opens the JSON configuration file. The run() method handles the execution of the extractor run.

The extractor implementation follows these primary steps :

  1. Selects the appropriate selector class through a dynamic dispatch, with the selector_type argument from the CLI input.
  2. Iterate through the data section in level-0 of the configuration file. On each data item, call the extract_content() method from the selector class to extract the content according to the specified extractor rule.
  3. If there are multiple levels of the extractor, i.e, if there is a ‘next’ attribute in the configuration file, call the traverse_next() utility function and parse through successive levels of the configuration file.
  4. According to the –output_type argument, the result data is saved in a JSON document or a CSV document.

scrapple.commands.web

class scrapple.commands.web.WebCommand(args)[source]

Defines the execution of web

execute_command()[source]

The web command runs the Scrapple web interface through a simple Flask app.

When the execute_command() method is called from the runCLI() function, it starts of two simultaneous processes :

  • Calls the run_flask() method to start the Flask app on port 5000 of localhost
  • Opens the web interface on a web browser

The ‘/’ view of the Flask app, opens up the Scrapple web interface. This provides a basic form, to fill in the required configuration file. On submitting the form, it makes a POST request, passing in the form in the request header. This form is passed to the form_to_json() utility function, where the form is converted into the resultant JSON configuration file.

Currently, closing the web command execution requires making a keyboard interrupt on the command line after the web interface has been closed.

Selector classes

Selectors are used to specifically point to certain tags on a web page, from which content has to be extracted. In Scrapple, selectors are implemented through selector classes, which define methods to extract necessary content through specified selector expressions and to extract links from anchor tags to be crawled through.

There are two selector types that are supported in Scrapple :

  • XPath expressions
  • CSS selector expressions

These selector types are implemented through the XpathSelector and CssSelector classes, respectively. These two classes use the Selector class as their super class.

In the super class, the URL of the web page to be loaded is validated - ensuring the schema has been specified, and that the URL is valid. A HTTP GET request is made to load the web page, and the HTML content of this fetched web page is used to generate the element tree. This is the element tree that will be parsed to extract the necessary content.

scrapple.selectors.selector

class scrapple.selectors.selector.Selector(url)[source]

This class defines the basic Selector object.

extract_columns(result={}, selector='', table_headers=[], attr='', connector='', default='', verbosity=0, *args, **kwargs)[source]

Column data extraction for extract_tabular

extract_content(selector='', attr='', default='', connector='', *args, **kwargs)[source]

Method for performing the content extraction for the particular selector type. If the selector is “url”, the URL of the current web page is returned. Otherwise, the selector expression is used to extract content. The particular attribute to be extracted (“text”, “href”, etc.) is specified in the method arguments, and this is used to extract the required content. If the content extracted is a link (from an attr value of “href” or “src”), the URL is parsed to convert the relative path into an absolute path.

If the selector does not fetch any content, the default value is returned. If no default value is specified, an exception is raised.

Parameters:
  • selector – The XPath expression
  • attr – The attribute to be extracted from the selected tag
  • default – The default value to be used if the selector does not return any data
  • connector – String connector for list of data returned for a particular selector
Returns:

The extracted content

Method for performing the link extraction for the crawler. The selector passed as the argument is a selector to point to the anchor tags that the crawler should pass through. A list of links is obtained, and the links are iterated through. The relative paths are converted into absolute paths and a XpathSelector/CssSelector object (as is the case) is created with the URL of the next page as the argument and this created object is yielded.

The extract_links method basically generates XpathSelector/CssSelector objects for all of the links to be crawled through.

Parameters:selector – The selector for the anchor tags to be crawled through
Returns:A XpathSelector/CssSelector object for every page to be crawled through
extract_rows(result={}, selector='', table_headers=[], attr='', connector='', default='', verbosity=0, *args, **kwargs)[source]

Row data extraction for extract_tabular

extract_tabular(header='', prefix='', suffix='', table_type='', *args, **kwargs)[source]

Method for performing the tabular data extraction. :param result: A dictionary containing the extracted data so far :param table_type: Can be “rows” or “columns”. This determines the type of table to be extracted. A row extraction is when there is a single row to be extracted and mapped to a set of headers. A column extraction is when a set of rows have to be extracted, giving a list of header-value mappings. :param header: The headers to be used for the table. This can be a list of headers, or a selector that gives the list of headers :param prefix: A prefix to be added to each header :param suffix: A suffix to be added to each header :param selector: For row extraction, this is a selector that gives the row to be extracted. For column extraction, this is a list of selectors for each column. :param attr: The attribute to be extracted from the selected tag :param default: The default value to be used if the selector does not return any data :param verbosity: The verbosity set as the argument for scrapple run :return: A 2-tuple containing the list of all the column headers extracted and the list of dictionaries which contain (header, content) pairs

scrapple.selectors.xpath

class scrapple.selectors.xpath.XpathSelector(url)[source]

The XpathSelector object defines XPath expressions.

scrapple.selectors.css

class scrapple.selectors.css.CssSelector(url)[source]

The CssSelector object defines CSS selector expressions.

Utility functions

scrapple.utils.dynamicdispatch

Functions related to dynamic dispatch of objects

scrapple.utils.dynamicdispatch.get_command_class(command)[source]

Called from runCLI() to select the command class for the selected command.

Parameters:command – The command to be implemented
Returns:The command class corresponding to the selected command

This function is implemented through this simple code block :

from scrapple.commands import genconfig, generate, run, web
cmdClass = getattr(eval(command), command.title() + 'Command')
return cmdClass

scrapple.utils.exceptions

Functions related to handling exceptions in the input arguments

The function uses regular expressions to validate the CLI input.

projectname_re = re.compile(r'[^a-zA-Z0-9_]')
if args['genconfig']:
        if args['--type'] not in ['scraper', 'crawler']:
                raise Exception("--type has to be 'scraper' or 'crawler'")
        if args['--selector'] not in ['xpath', 'css']:
                raise Exception("--selector has to be 'xpath' or 'css'")
if args['generate'] or args['run']:
        if args['--output_type'] not in ['json', 'csv']:
                raise Exception("--output_type has to be 'json' or 'csv'")
if args['genconfig'] or args['generate'] or args['run']:
        if projectname_re.search(args['<projectname>']) is not None:
                raise Exception("Invalid <projectname>")
return

scrapple.utils.config

Functions related to traversing the configuration file

scrapple.utils.config.traverse_next(page, nextx, results, tabular_data_headers=[], verbosity=0)[source]

Recursive generator to traverse through the next attribute and crawl through the links to be followed.

Parameters:
  • page – The current page being parsed
  • next – The next attribute of the current scraping dict
  • results – The current extracted content, stored in a dict
Returns:

The extracted content, through a generator

In the case of crawlers, the configuration file can be treated as a tree, with the anchor tag links extracted from the follow link selector as the child nodes. This level-wise representation of the crawler configuration file provides a clear picture of how the file should be parsed.

Tree representation of crawler

Tree representation of crawler

This recursive generator performs a depth-first traversal of the config file tree. It can be implemented through this code snippet :

for link in page.extract_links(next['follow_link']):
        r = results.copy()
        for attribute in next['scraping'].get('data'):
                if attribute['field'] != "":
                        r[attribute['field']] = \
                        link.extract_content(attribute['selector'],
                        attribute['attr'],
                        attribute['default'])
        if not next['scraping'].get('next'):
                yield r
        else:
                for next2 in next['scraping'].get('next'):
                        for result in traverse_next(link, next2, r):
                                yield result
scrapple.utils.config.get_fields(config)[source]

Recursive generator that yields the field names in the config file

Parameters:config – The configuration file that contains the specification of the extractor
Returns:The field names in the config file, through a generator

get_fields() parses the configuration file through a recursive generator, yielding the field names encountered.

for data in config['scraping']['data']:
        if data['field'] != '':
                yield data['field']
if 'next' in config['scraping']:
        for n in config['scraping']['next']:
                for f in get_fields(n):
                        yield f
scrapple.utils.config.extract_fieldnames(config)[source]

Function to return a list of unique field names from the config file

Parameters:config – The configuration file that contains the specification of the extractor
Returns:A list of field names from the config file

The extract_fieldnames() function uses the get_fields() generator, and handles cases like multiple occurrences of the same field name.

fields = []
for x in get_fields(config):
        if x in fields:
                fields.append(x + '_' + str(fields.count(x) + 1))
        else:
                fields.append(x)
return fields

scrapple.utils.form

Functions related to form handling.

scrapple.utils.form.form_to_json(form)[source]

Takes the form from the POST request in the web interface, and generates the JSON config file

Parameters:form – The form from the POST request
Returns:None

The web form is structured in a way that all the data field are linearly numbered. This is done so that it is easier to process the form while converting it into a JSON document.

for i in itertools.count(start=1):
        try:
                data = {
                        'field': form['field_' + str(i)],
                        'selector': form['selector_' + str(i)],
                        'attr': form['attribute_' + str(i)],
                        'default': form['default_' + str(i)]
                }
                config['scraping']['data'].append(data)
        except KeyError:
                break
Scrapple implementation classes
The classes involved in the implementation of Scrapple
Interaction scenarios
Interaction scenarios in the implementation of each of the Scrapple commands
Command line interface
The Scrapple command line interface
Command classes
The implementation of the command classes
Selector classes
The implementation of the selector classes
Utility functions
Utilities functions that support the implementation of the extractor

Experimentation & Results

In this section, some experiments with Scrapple are provided. There are two main types of tools that can be implemented with the Scrapple framework :

Once you’ve installed Scrapple, you can see the list of available commands and the related options using the command

$ scrapple --help

The configuration file is the backbone of Scrapple. It specifies the base page URL, selectors for the data extraction, the follow link for the link crawler and several other parameters.

Examples for each type are given.

Single page linear scrapers

For this example, we will extract content from a pyvideo page. Let us consider the following page.

To generate a skeleton configuration file, use the genconfig command. The primary arguments for the command are the project name and the URL of the base page.

$ scrapple genconfig pyvideo \
> http://pyvideo.org/video/1785/python-for-humans-1

This will create pyvideo.json which will initially look like this -

{

    "scraping": {
        "url": "http://pyvideo.org/video/1785/python-for-humans-1",
        "data": [
            {
                "field": "",
                "attr": "",
                "selector": "",
                "default": ""
            }
        ]
    },
    "project_name": "pyvideo",
    "selector_type": "xpath"

}

You can edit this json file to specify selectors for the various data that you would want to extract from the given page.

For example,

{

    "scraping": {
        "url": "http://pyvideo.org/video/1785/python-for-humans-1",
        "data": [
            {
                "field": "title",
                "attr": "text",
                "selector": "//h3",
                "default": ""
            },
            {
                "field": "speaker_name",
                "attr": "text",
                "selector": "//div[@id='sidebar']//dd[2]//a",
                "default": ""
            },
            {
                "field": "speaker_link",
                "attr": "href",
                "selector": "//div[@id='sidebar']//dd[2]//a",
                "default": ""
            },
            {
                "field": "event_name",
                "attr": "text",
                "selector": "//div[@id='sidebar']//dd[1]//a",
                "default": ""
            },
            {
                "field": "event_link",
                "attr": "href",
                "selector": "//div[@id='sidebar']//dd[1]//a",
                "default": ""
            }
        ]
    },
    "project_name": "pyvideo",
    "selector_type": "xpath"

}

Using this configuration file, you could generate a Python script using scrapple generate or directly run the scraper using scrapple run.

The generate and run commands take two positional arguments - the project name and the output file name.

To generate the Python script -

$ scrapple generate pyvideo talk1

This will create talk1.py, which is the script that can be run to replicate the action of scrapple run.

from __future__ import print_function
import json
import os

from scrapple.selectors.xpath import XpathSelector


def task_pyvideo():
        """
        Script generated using
        `Scrapple <http://scrappleapp.github.io/scrapple>`_
        """
        results = dict()
        results['project'] = "pyvideo"
        results['data'] = list()
        try:
                r0 = dict()
                page0 = XpathSelector(
                "http://pyvideo.org/video/1785/python-for-humans-1"
                )
                r0["title"] = page0.extract_content(
                "//h3", "text", ""
                )
                r0["speaker_name"] = page0.extract_content(
                "//div[@id='sidebar']//dd[2]//a", "text", ""
                )
                r0["speaker_link"] = page0.extract_content(
                "//div[@id='sidebar']//dd[2]//a", "href", ""
                )
                r0["event_name"] = page0.extract_content(
                "//div[@id='sidebar']//dd[1]//a", "text", ""
                )
                r0["event_link"] = page0.extract_content(
                "//div[@id='sidebar']//dd[1]//a", "href", ""
                )
                results['data'].append(r0)
        except KeyboardInterrupt:
                pass
        except Exception as e:
                print(e)
        finally:
                with open(os.path.join(os.getcwd(), 'talk1.json'), 'w') as f:
                        json.dump(results, f)


if __name__ == '__main__':
        task_pyvideo()

To run the scraper -

$ scrapple run pyvideo talk1

This will create talk1.json, which contains the extracted information.

{

    "project": "test1",
    "data": [
        {
            "event_name": "PyCon US 2013",
            "event_link": "/category/33/pycon-us-2013",
            "speaker_link": "/speaker/726/kenneth-reitz",
            "speaker_name": "Kenneth Reitz",
            "title": "Python for Humans"
        }
    ]

}

Comparison between Scrapple & Ducky

From the experiments performed with the Scrapple framework, we see that correctly written configuration files give accurate results. In the single page linear extractor example and link crawler example (where over 2800 pages were crawled through), an accuracy level of 100% was achieved.

The accuracy of the implementation of the Scrapple framework is dependent on the user’s understanding of web structure and the ability to write correct selector expressions.

On comparison with Ducky [1], it can be seen that Ducky also provides an accuracy of 100%. The primary difference between the Scrapple framework and the Ducky framework is the features provided.

Feature Scrapple Ducky
Configuration file YES YES
CSS selectors YES YES
XPath selectors YES NO
CSV output YES YES
JSON output YES YES
XML output NO YES
Generation of extractor script YES NO
Single page linear scrapers
Tutorial for single page linear extractors
Link crawlers
Tutorial for link crawlers

Contributing to Scrapple

Scrapple is on GitHub !

Authors

Scrapple is written and maintained by the following contributors :

History

0.3.0 - 2016-09-23

  • Set up table scraping parameters and execution
  • Fix json configuration generation

0.2.6 - 2015-11-27

  • Edit requirements

0.2.5 - 2015-05-28

  • Add levels argument for genconfig command, to create crawler config files for a specific depth

0.2.4 - 2015-04-13

  • Update documentation
  • Minor fixes

0.2.3 - 2015-03-11

  • Include implementation to use csv as the output format

0.2.2 - 2015-02-22

  • Fix bug in generate script template

0.2.1 - 2015-02-21

  • Update tests

0.2.0 - 2015-02-20

  • Include implementation for scrapple run and scrapple generate for crawlers
  • Modify web interface for editing scraper config files
  • Revise skeleton configuration files

0.1.1 - 2015-02-10

  • Release on PyPI with revisions
  • Include web interface for editing scraper config files
  • Modified implementations of certain functions

0.1.0 - 2015-02-04

  • First release on PyPI

Contributing

Contributions are welcome, and they are greatly appreciated! Every little bit helps, and credit will always be given.

You can contribute in many ways:

Types of Contributions

Report Bugs

Report bugs at https://github.com/scrappleapp/scrapple/issues.

If you are reporting a bug, please include:

  • Your operating system name and version.
  • Any details about your local setup that might be helpful in troubleshooting.
  • Detailed steps to reproduce the bug.
Fix Bugs

Look through the GitHub issues for bugs. Anything tagged with “bug” is open to whoever wants to fix it.

Implement Features

Look through the GitHub issues for features. Anything tagged with “feature” is open to whoever wants to implement it.

Write Documentation

Scrapple could always use more documentation, whether as part of the official Scrapple docs, in docstrings, or even on the web in blog posts, articles, and such.

Submit Feedback

The best way to send feedback is to file an issue at https://github.com/scrappleapp/scrapple/issues.

If you are proposing a feature:

  • Explain in detail how it would work.
  • Keep the scope as narrow as possible, to make it easier to implement.
  • Remember that this is a volunteer-driven project, and that contributions are welcome :)

Get Started!

Ready to contribute? Here’s how to set up scrapple for local development.

  1. Fork the scrapple repo on GitHub.

  2. Clone your fork locally:

    $ git clone git@github.com:your_name_here/scrapple.git
    
  3. Install your local copy into a virtualenv. Assuming you have virtualenvwrapper installed, this is how you set up your fork for local development:

    $ mkvirtualenv scrapple
    $ cd scrapple/
    $ python setup.py develop
    
  4. Create a branch for local development:

    $ git checkout -b name-of-your-bugfix-or-feature
    

    Now you can make your changes locally.

  5. When you’re done making changes, check that your changes pass nosetests:

    $ cd tests && nosetests -v
    
  6. Commit your changes and push your branch to GitHub:

    $ git add .
    $ git commit -m "Your detailed description of your changes."
    $ git push origin name-of-your-bugfix-or-feature
    
  7. Submit a pull request through the GitHub website.

Pull Request Guidelines

Before you submit a pull request, check that it meets these guidelines:

  1. The pull request should include tests.
  2. If the pull request adds functionality, the docs should be updated. Put your new functionality into a function with a docstring, and add the feature to the list in README.rst.
  3. The pull request should work for Python 2.6, 2.7, 3.3, and 3.4, and for PyPy. Check https://travis-ci.org/scrappleapp/scrapple/pull_requests and make sure that the tests pass for all supported Python versions.
Authors
The creators of Scrapple
History
History of Scrapple releases
Contributing
The Scrapple contribution guide

The goal of Scrapple is to provide a generalized solution to the problem of web content extraction. This framework requires a basic understanding of web page structure, which is necessary to write the necessary selector expressions. Using these selector expressions, the required web content extractors can be implemented to generate the desired datasets.

Experimentation with a wide range of websites gave consistently accurate results, in terms of the generated dataset. However, larger crawl jobs took a lot of time to complete and it was necessary to run the execution in one stretch. Scrapple could be improved to provide restartable crawlers, using caching mechanisms to keep track of the position in the URL frontier. Tag recommendation systems could also be implemented, using complex learning algorithms, though there would be a trade-off on accuracy.