Diffbot PHP Client Documentation

This documentation is intended for Diffbot’s PHP client - for general Diffbot docs, see Diffbot Documentation.

If this is your first time encountering the PHP client, it’s recommended you read the overview.

Contents:

Overview

Diffbot

Diffbot is a visual machine learning AI which processes renders of web pages to generate structured JSON entities.

In other words, you give Diffbot a URL and it returns human-readable data about it, but doesn’t rely on what it finds in the source code - rather, it reads the renders like humans do, visually extracting the human-directed content to provide reliable information about what’s actually being said on the page being processed. In that effect, it is relatively untrickable by over-optimized SEO meta content.

Diffbot exposes its services via a set of API endpoints.

To read more about Diffbot, see the official documentation, or some of the following tutorials:

Diffbot PHP Client

The Diffbot PHP Client is the official PHP wrapper for the API endpoints Diffbot provides.

By using the PHP client, the developer can interact with both the APIs and the returned entities in an object oriented manner, rather than parse JSON and extract data manually. The PHP client uses Guzzle to issue requests to the API. It is currently built on top of Guzzle 5, and there are no immediate plans to transition to the feature-lacking version 6.

Quickstart

Install via Composer:

composer require swader/diffbot-php-client

Create a Diffbot instance, provide a token, specify the URL you want to process, and use all this to create an instance of the API endpoint:

$diffbot = new Diffbot('my_token');
$url = 'http://www.sitepoint.com/diffbot-crawling-visual-machine-learning/';
$articleApi = $diffbot->createArticleAPI($url);

Configure the API call with some setters (all will be explained in this documentation) and issue the call:

$processedArticle = $articleApi->setDiscussion(false);

Consume the resulting data entity any way you see fit:

echo $processedArticle->author; // Bruno Skvorc

Diffbot Class

The Diffbot class is the first instance a developer must create when using the client. It serves as a container for global settings, and as a factory for the various API endpoint classes.

class Swader\Diffbot\Diffbot

The Diffbot class takes a single optional argument, the $token, which can be obtained here. Instantiate like so:

$diffbot = new Diffbot("my_token");

Alternatively, set the token globally, and instantiate without passing in the parameter:

Diffbot::setToken("my_token");
$diffbot = new Diffbot();

Note that if you instantiate without a global token set, and don’t pass in a token while instantiating either, you’ll get a Swader\Diffbot\Exceptions\DiffbotException thrown.

static Swader\Diffbot\Diffbot::setToken($token)
Parameters:
  • $token (string) – The token.
Returns:

void, or throws an \InvalidArgumentException if the token is invalid

Useful for setting a default token for all future instances.

Usage:

Diffbot::setToken("my_token");

Swader\Diffbot\Diffbot::getToken()
Returns:null or string

Returns either the instance token, or the globally defined one - or null if neither is defined

Usage:

echo $diffbot->getToken(); // "my_token"

Swader\Diffbot\Diffbot::setHttpClient(GuzzleHttp\Client $client)
Parameters:
  • $client (GuzzleHttp\Client) – The HTTP client.
Returns:

$this

Allows changing of HTTP clients used to send requests to the Diffbot API. Generally useful only during testing, but some edge cases may arise. This method does not need to be called for Diffbot to be usable - it will default to a new instance of the regular GuzzleHttpClient.

Usage:

$client = new GuzzleHttp\Client();
$diffbot->setHttpClient($client);

Swader\Diffbot\Diffbot::getHttpClient()

Returns the currently set HTTP client. Can be changed via Swader\Diffbot\Diffbot::setHttpClient.

Returns:GuzzleHttp\Client

Swader\Diffbot\Diffbot::setEntityFactory($factory)
Parameters:
Returns:

$this

Allows for changing the entity factory in use when returning and processing Diffbot-provided data. A custom Entity Factory might, for example, return Author entities (also custom) for all calls to a custom API set up in a user’s Diffbot account. This helps with getting fully consumable custom data right from the API source, rather than requiring additional processing.

If not explicitly set, defaults to built-in Swader\Diffbot\Factory\Entity.

Usage:

$newEntityFactory = new \My\Custom\EntityFactory();

$diffbot = new Diffbot('my_token');
$diffbot->setEntityFactory($newEntityFactory);

// @todo: Full tutorial about a custom Entity and EntityFactory

Swader\Diffbot\Diffbot::getEntityFactory()
Returns:Swader\Diffbot\Interfaces\EntityFactory

Returns the currently defined Swader\Diffbot\Interfaces\EntityFactory instance. This method generally isn’t needed outside of testing scenarios. See above for usage of the setter.

Swader\Diffbot\Diffbot::createProductApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Product

The product API turns web shops, catalogs, etc. into structured JSON (think eBay, Amazon...). This method creates an instance of the Swader\Diffbot\Api\Product class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). For a detailed directory of available methods and in depth usage examples, see the Swader\Diffbot\Api\Product documentation.

Usage:

$api = $diffbot->createProductApi("http://www.amazon.com/Oh-The-Places-Youll-Go/dp/0679805273/");
$result = $api->call();

echo $result->offerPrice; // $11.99
echo $result->getIsbn(); // 0679805273

Swader\Diffbot\Diffbot::createArticleApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Article

The article API turns online news posts, blog articles, etc. into structured JSON. This method creates an instance of the Swader\Diffbot\Api\Article class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). For a detailed directory of available methods and in depth usage examples, see the Swader\Diffbot\Api\Article documentation.

Usage:

$api = $diffbot->createArticleApi("http://techcrunch.com/2012/05/31/diffbot-raises-2-million-seed-round-for-web-content-extraction-technology/");
$result = $api->call();

echo $result->publisherCountry; // United States
echo $result->getAuthor(); // Sarah Perez

Swader\Diffbot\Diffbot::createImageApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Image

The image API finds images in a post and returns them as JSON. This method creates an instance of the Swader\Diffbot\Api\Image class. The method accepts a single string as a parameter: either a URL which to process for images, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). For a detailed directory of available methods and in depth usage examples, see the Swader\Diffbot\Api\Image documentation. Note that unlike Product and Article, the Image API can return several Image entities (see usage below). If not iterated through, the result refers to the first image only.

Usage:

$api = $diffbot->createImageApi("http://smittenkitchen.com/blog/2012/01/buckwheat-baby-with-salted-caramel-syrup/");
$result = $api->call();

echo $result->naturalHeight; // 333

foreach ($result as $image) {
    echo $result->title;
    echo $result->getXPath();
}

Swader\Diffbot\Diffbot::createAnalyzeApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Analyze

The analyze API tries to autodetect the content it’s dealing with (image, product, article...) and extracts it into structured JSON. This method creates an instance of the Swader\Diffbot\Api\Analyze class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). The Analyze API is the default API used during Swader\Diffbot\Diffbot::crawl mode.

Usage:

$api = $diffbot->createAnalyzeApi("http://techcrunch.com/2012/05/31/diffbot-raises-2-million-seed-round-for-web-content-extraction-technology/");
$result = $api->call();

echo $result->publisherCountry; // United States
echo $result->getAuthor(); // Sarah Perez

Swader\Diffbot\Diffbot::createDiscussionApi($url)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
Returns:

Swader\Diffbot\Api\Discussion

The discussion API turns online comments, forum topics or pages of reviews into structured JSON. Think Amazon review section, Youtube comments, article Disqus comments, etc. This method creates an instance of the Swader\Diffbot\Api\Discussion. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below). Like the Image API above, this one also returns several Swader\Diffbot\Api\Discussion entities per call, if available, along with other data - see usage below.

Usage:

$api = $diffbot->createDiscussionApi("http://boards.straightdope.com/sdmb/showthread.php?t=740315");
$result = $api->call();

echo $result->numPosts; // 43
echo $result->getParticipants(); // 23

foreach ($result as $post) {
    echo $post->getAuthor();
    echo $post->votes;
}

Swader\Diffbot\Diffbot::createCustomApi($url, $name)
Parameters:
  • $url (string) – URL which is to be processed, or the word “crawl”
  • $name (string) – Name of the custom API as defined in the Diffbot UI
Returns:

Swader\Diffbot\Api\Custom

Diffbot customers can define Custom APIs. For a tutorial on doing this, see here. What it comes down to, is that you can tell Diffbot how to recognize certain areas of a web page, and have it translate that into JSON for you if none of the standard APIs do the trick. This allows for much more lightweight and specific calls, resulting in a quicker turnaround and (usually) more precise data. This method creates an instance of the Swader\Diffbot\Api\Custom. The method accepts two parameters: either a URL which to process, or the word “crawl” if used in conjunction with the Swader\Diffbot\Diffbot::crawl method (see below), and the name of the custom API to use. Unlike other APIs, this one has no specific entity to return and instead returns a Swader\Diffbot\Entity\Wildcard entity which matches anything.

Usage:

$api = $api->createCustomApi("http://sitepoint.com/author/bskvorc", "AuthorFolio");
$result = $api->call();

echo $result->bio; // Bruno is a coder from Croatia with Master's Degrees in...

Swader\Diffbot\Diffbot::crawl($name = null, Swader\Diffbot\Api $api = null)
Parameters:
  • $name (string) – Name of the new crawljob. If omitted, activates read only mode and returns joint data about all defined crawljobs for the current Diffbot token.
  • $api (Swader\Diffbot\Api) – Instance of the API to process the crawled URLs. If omitted, defaults to Swader\Diffbot\Api\Analyze.
Returns:

Swader\Diffbot\Api\Crawl

The crawl method is used to create new Crawlbot job (crawljob). To find out more about Crawlbot and what, how and why it does what it does, see here. I also recommend reading the Crawlbot API docs and the Crawlbot support topics just so you can dive right in without being too confused by the code below.

In a nutshell, the Crawlbot crawls a set of seed URLs for links (even if a subdomain is passed to it as seed URL, it still looks through the entire main domain and all other subdomains it can find) and then processes all the pages it can find using the API you define (or opting for Analyze API by default). The result of the call is a collection of Swader\Diffbot\Entity\JobCrawl objects, each with details about a defined job. To actually get data obtained by crawling and processing, use the Swader\Diffbot\Diffbot::search API.

Here’s how you can create a crawljob (see detailed Swader\Diffbot\Api\Search for a step by step guide with explanations):

$url = 'crawl';
$articleApi = $diffbot->createArticleAPI($url)->setDiscussion(false);

$crawl = $diffbot->crawl('mycrawl_01', $articleApi);

$crawl->setSeeds(['http://sitepoint.com']);

$job = $crawl->call();

// See JobCrawl class to find out which getters are available
dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result

API Abstract

This page will describe the API Abstract class - the one which all the API classes extend to get some common functionality. Use this to build your own API class for custom APIs you defined in the Diffbot UI.

class Swader\Diffbot\Abstracts\Api

Swader\Diffbot\Abstracts\Api::__construct($url)
Parameters:
  • $url (string) – The URL of the page to process
Throws:

InvalidArgumentException if the URL is invalid AND not the word “crawl”

This class takes a single argument during construction, the URL of the page to process. Alternatively, the argument can be “crawl”, if the API is to be used in conjunction with the Swader\Diffbot\Api\Crawl API.

Swader\Diffbot\Abstracts\Api::setTimeout($timeout = 30000)
Parameters:
  • $timeout (int) – Optional. The timeout, in milliseconds. Defaults to 30,000, a.k.a. 30 seconds
Returns:

$this

Throws:

InvalidArgumentException if the timeout value is invalid (negative or not an integer)

Setting the timeout will define how long Diffbot will keep trying to fetch the API results. A timeout can happen for various reasons, from Diffbot’s failure, to the site being crawled being exceptionally slow, and more.

Usage:

$api->setTimeout(40000);

Swader\Diffbot\Abstracts\Api::call()
Returns:Swader\Diffbot\Entity\EntityIterator The return value will be an iterable collection of appropriate entities. Refer to each API’s documentation for details on entities returned from each API call.

When the API instance has been fully configured, this method executes the call.

Usage:

$result = $api->call();
foreach ($result as $entity) { /* ... */ }

Swader\Diffbot\Abstracts\Api::buildUrl()
Returns:string

This method is called automatically when Swader\Diffbot\Abstracts\Api::call is called. It builds the URL which is to be called by the HTTPClient in Swader\Diffbot\Diffbot::setHttpClient, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.

Usage:

$api-> // ... set up API
$myUrl = $api->buildUrl();

Entity Abstract

This page will describe the Entity Abstract class. This class is the root of all Entity classes. Entity classes are used as containers for return values from various API endpoints. For example, the Article API will return an Article Entity, the Discussion API will return a Discussion Entity, and so on.

It is important to note that an API class will never return an Entity class directly. Rather, it will return an Swader\Diffbot\Entity\EntityIterator, an iterable container with all the Entities inside. The container, however, is configured in such a way that executing get methods on it directly will forward those calls to the first Entity in its dataset. See Swader\Diffbot\Entity\EntityIterator.

class Swader\Diffbot\Abstracts\Entity

Swader\Diffbot\Abstracts\Entity::__construct(array $data)

This class takes a single argument during construction, an array of data. This data is then turned into gettable information by means of getters, both direct and magic. Some getters do additional processing of the data in order to make it more useful to the user.

Parameters:
  • $data (array) – The data

Swader\Diffbot\Abstracts\Entity::getData()

Returns the raw data passed into the Entity by the parent API class. This will be an associative array (see Usage below).

Returns:array

Usage:

// ...

$data = $article->getData();

echo $data['title'];
echo $data['author'];

// etc.

Swader\Diffbot\Abstracts\Entity::__call()

Magic method for resolving undefined getters and only getters. If the method being called starts with get, the remainder of its name will be turned into a key to search inside the $data property (see getData). Once the call is identified as a getter call, __get is invoked (see below).

Returns:mixed
Throws:BadMethodCallException if the prefix of the method is not get

Swader\Diffbot\Abstracts\Entity::__get()

This method is called automatically when __call is called. It looks for the property being asked for inside the $data property of the current class, or returns null if not found.

returns:string

Usage:

$api-> // ... set up API
$myUrl = $api->buildUrl();

Article API

This API is used to turn content like blog posts, news articles, and other prose into JSON.

For examples of data that might be returned, please see http://diffbot.com and run the Article API demo.

The Article API part of the Diffbot PHP client consists of two main classes: the API class, and the Article Entity class. We’ll describe them in order. Note that the API class extends Swader\Diffbot\Abstracts\Api, so be sure to read that first if you haven’t already.

Article API Class

class Swader\Diffbot\Api\Article

Basic Usage:

use Swader\Diffbot\Diffbot;

$url = 'http://some-article-to-process.com';

$diffbot = new Diffbot('my_token');
$api = $diffbot->createArticleApi($url);
Swader\Diffbot\Api\Article::setSentiment($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

This method sets the sentiment optional field value. This determines whether or not to return the sentiment score of the analyzed article text, a value ranging from -1.0 (very negative) to 1.0 (very positive). Sentiment analysis is powered by Semantria for advanced features like keyword and entity extraction, but the basic sentiment analysis (score only) is enabled for everyone, even those without Semantria accounts.

Usage:

$url = 'http://www.sitepoint.com/diffbot-crawling-visual-machine-learning/';

// ...

$api->setSentiment(true);
$result = $api->cal();

// ...

echo $result->sentiment; // -0.0979
Swader\Diffbot\Api\Article::setPaging($bool = true)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

If set to false, Diffbot will not auto-concatenate several pages of a multi-page article into one. Defaults to true, max 20 pages.

For more info about auto-concatenation, see here.

While practical, this is a less reliable method of concatenating long posts than finding out the number of pages manually and processing them each one by one. Not only does it often fail to recognize the next page links, but also if there’s a chance that the series is longer than 20 parts, everything from 20 onward will remain ignored. This is a limitation of Diffbot, not the client, and there’s little chance of it changing - concatenations longer than 20 pages would likely trigger timeouts as the page count becomes less and less trivial.

If you need to process multiple pages of something, it is thus recommended you find out those links yourself, then pass them into Article API one by one and concatenate later. If you’d like to analyze the entire concatenated post after the fact, it’s best to manually concat and then send the merged content into Diffbot as a POST value for processing.

Usage:

$url = 'http://www.some-seven-part-article.com/';

// ...

$api->setPaging(true);
$result = $api->cal();

// ...

echo $result->numPages; // 7
Swader\Diffbot\Api\Article::setMaxTags($max = 5)
Parameters:
  • $max (int) – The number of tags to generate and return
Returns:

$this

Set the maximum number of automatically-generated tags to return. By default a maximum of five tags will be returned. Tags are a built-in feature of Diffbot, and could generate different results on two different calls to the same URL provided enough time has passed, due to Diffbot’s engine evolving over time as it processed more and more content.

For an example of what the tags might look like, run the demo example at https://diffbot.com or see Swader\Diffbot\Entity\Article::getTags.

Swader\Diffbot\Api\Article::setDiscussion($bool = true)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

Whether or not to use the Discussion API to additionally process any detected comment or review threads in the article. Behaves as if the Swader\Diffbot\Api\Discussion was set to process the page, and merges the returned data with the Article API’s results by means of a discussion field in the result. The field will have all the sub-fields of the usual Swader\Diffbot\Api\Discussion call; i.e. you will be able to access the Swader\Diffbot\Entity\Discussion entity and all its sub entities via the Swader\Diffbot\Entity\Article::getDiscussion method.

Article Entity Class

When the Article API is done processing an article (or several) the result will be an Article Entity (i.e. a collection of one Article Entities inside an instance of Swader\Diffbot\Entity\EntityIterator).

For an overview of the abstract class all Entities build on, see Swader\Diffbot\Abstracts\Entity.

Note that the Article entity can also be returned by the Swader\Diffbot\Api\Analyze API in “article” mode, or in default mode when processing a URL that contains an article (auto-determined).

class Swader\Diffbot\Entity\Article
Swader\Diffbot\Entity\Article::__construct(array $data)
Parameters:
  • $data (array) – The data from which to build the Article entity

The Article entity’s constructor needs the data to populate its properties (see getters below). This class is automatically instantiated after an Swader\Diffbot\Api\Article or Swader\Diffbot\Api\Analyze call. You probably won’t ever need to manually create an instance of this class.

In the case of the Article entity, the constructor differs from the abstract one (Swader\Diffbot\Abstracts\Api::__construct) in that it also looks for the discussion key in the result, in order to build a Swader\Diffbot\Entity\Discussion sub-entity (see Swader\Diffbot\Entity\Article::getDiscussion).

Swader\Diffbot\Entity\Article::getType()
Returns:string

Will always return “article” for articles:

// ... API setup ... //
$result = $api->call();

echo $result->getType(); // "article"
Swader\Diffbot\Entity\Article::getText()
Returns:string | null

Returns the plaintext content of the processed article. HTML tags are stripped completely, images are removed. If the text property is missing in the result, returns null.

Swader\Diffbot\Entity\Article::getHtml()
Returns:string

Returns the full HTML content of the article. If the HTML property is missing in the result, returns null.

Swader\Diffbot\Entity\Article::getDate()
Returns:string

Returns date as per RFC 2616. Example date: “Wed, 18 Dec 2013 00:00:00 GMT”. Note that this is strtotime friendly for further conversions.

Swader\Diffbot\Entity\Article::getAuthor()
Returns:string | null

Returns the name of the author as written on the page. If Diffbot was unable to figure out who the author is, null is returned.

Swader\Diffbot\Entity\Article::getTags()
Returns:array

Returns an array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia and other data sources. Note that these are not the meta tags as defined by the author, but machine learned ones:

// ... API setup ... //

// URL: "http://www.sitepoint.com/diffbot-crawling-visual-machine-learning" //

$result = $api->call();

echo count($result->tags); // 5

var_dump($result->tags);

/** Output:
array (size=5)
  0 =>
    array (size=4)
      'count' => int 1
      'score' => float 0.62
      'label' => string 'Machine learning' (length=16)
      'uri' => string 'http://dbpedia.org/resource/Machine_learning' (length=44)
  1 =>
    array (size=4)
      'count' => int 4
      'score' => float 0.61
      'label' => string 'Web crawler' (length=11)
      'uri' => string 'http://dbpedia.org/resource/Web_crawler' (length=39)
  2 =>
    array (size=4)
      'count' => int 4
      'score' => float 0.59
      'label' => string 'Lexical analysis' (length=16)
      'uri' => string 'http://dbpedia.org/resource/Lexical_analysis' (length=44)
  3 =>
    array (size=4)
      'count' => int 7
      'score' => float 0.54
      'label' => string 'Uniform resource locator' (length=24)
      'uri' => string 'http://dbpedia.org/resource/Uniform_resource_locator' (length=52)
  4 =>
    array (size=5)
      'count' => int 2
      'score' => float 0.52
      'label' => string 'JavaScript' (length=10)
      'rdfTypes' =>
        array (size=3)
          0 => string 'http://dbpedia.org/ontology/ProgrammingLanguage' (length=47)
          1 => string 'http://dbpedia.org/ontology/Software' (length=36)
          2 => string 'http://dbpedia.org/ontology/Work' (length=32)
      'uri' => string 'http://dbpedia.org/resource/JavaScript' (length=38)
**/

Returns a maximum of 5 by default, though this can be changed in Swader\Diffbot\Api\Article::setMaxTags.

Swader\Diffbot\Entity\Article::getNumPages()
Returns:int

Returns the number of pages if the article is a multi-page one. Read about auto-concatenation here and study the Swader\Diffbot\Api\Article::setPaging method for more details.

Swader\Diffbot\Entity\Article::getNextPages()
Returns:array

If the article is a multi-page one, returns the list of absolute URLs of the pages that follow after the one that was processed. If the article is a single-page one, an empty array is returned.

Swader\Diffbot\Entity\Article::getSentiment()
Returns:float | null

Returns the sentiment score of the analyzed article text, a value ranging from -1.0 (very negative) to 1.0 (very positive). If sentiment score is absent (due to Diffbot being unable to determine it, or due to Swader\Diffbot\Api\Article::setSentiment being set to false, returns null.

Swader\Diffbot\Entity\Article::getDiscussion()
Returns:Swader\Diffbot\Entity\Discussion | null

Returns the Swader\Diffbot\Entity\Discussion found on the article’s page (comments section). See Swader\Diffbot\Api\Article::setDiscussion for details and below for usage:

use Swader\Diffbot\Diffbot;

$url = "www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/";

$diffbot = new Diffbot("my_token");
$api = $diffbot->createArticleApi($url);

$result = $api->call();

echo $result->getDiscussion()->getNumPosts(); // 7
echo $result->getDiscussion()->getProvider(); // Disqus

For other methods exposed on the Swader\Diffbot\Entity\Discussion entity, see its documentation.

Swader\Diffbot\Entity\Article::getImages()
Returns:array

An array of images found in the article, with their details. The elements of the array are arrays like this one:

/**

array (size=7)
  'height' => int 512
  'diffbotUri' => string 'image|3|-851701004' (length=18)
  'naturalHeight' => int 727
  'width' => int 749
  'primary' => boolean true
  'naturalWidth' => int 1063
  'url' => string 'http://dab1nmslvvntp.cloudfront.net/wp-content/uploads/2014/07/140624455201.png' (length=79)

**/

Unlike the Swader\Diffbot\Api\Discussion API which returns details about discussion posts even when used with the Swader\Diffbot\Api\Article API, the image data returned with this method is minimal. For fuller details about images, use the Swader\Diffbot\Api\Image API.

Swader\Diffbot\Entity\Article::getVideos()
Returns:array

Essentially identical to the above getImages, but for videos. Arrays in the resulting array will look like this:

/**
[
    "diffbotUri": "video|3|-1138675744",
    "primary": true,
    "url": "http://player.vimeo.com/video/22439234"
]
**/

Product API

The Product API is used to parse pages representing products. These can be anything from eBay auction pages and books on Amazon, to leashes and collars in “mom and pop’s pet web shop”.

The Product API will attempt to recognize some of the most popular product-related fields in any given product page, including but not limited to:

  • price
  • discount
  • availability status
  • stock level
  • characteristics / stats (like smartphone capacity, battery life, network type...)
  • reviews
  • unique identification number like SKU / ISBN / MPN / UPC...
  • and much more...

For a more thorough walk through the product API, see the official docs and demo.

The Product API part of the Diffbot PHP client consists of two main classes: the API class, and the Product Entity class. We’ll describe them in order. Note that the API class extends Swader\Diffbot\Abstracts\Api, so be sure to read that first if you haven’t already.

Product API Class

class Swader\Diffbot\Api\Product

Basic Usage:

use Swader\Diffbot\Diffbot;

$url = 'http://some-product-to-process.com';

$diffbot = new Diffbot('my_token');
$api = $diffbot->createProductApi($url);
Swader\Diffbot\Api\Product::setDiscussion($bool = true)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

Whether or not to use the Discussion API to additionally process any detected comment or review threads on the product page. Behaves as if the Swader\Diffbot\Api\Discussion was set to process the page, and merges the returned data with the Product API’s results by means of a discussion field in the result. The field will have all the sub-fields of the usual Swader\Diffbot\Api\Discussion call; i.e. you will be able to access the Swader\Diffbot\Entity\Discussion entity and all its sub entities via the Swader\Diffbot\Entity\Product::getDiscussion method.

Swader\Diffbot\Api\Product::setColors($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

If set to true, the Product API will try to find out the color options of the product, if available. This feature is experimental and often fails even when color options are obvious.

Swader\Diffbot\Api\Product::setAvailability($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

If set to true, Diffbot will attempt to find out whether or not the product in question is available / in stock.

Swader\Diffbot\Api\Product::setSize($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

If set to true, Diffbot will attempt to find out which sizes the product is offered in. Similar to Swader\Diffbot\Api\Product::setColors, this method is unreliable and highly experimental.

Product Entity Class

When the Product API is done processing a product (or several) the result will be a Product Entity (i.e. a collection of one Product Entities inside an instance of Swader\Diffbot\Entity\EntityIterator).

For an overview of the abstract class all Entities build on, see Swader\Diffbot\Abstracts\Entity.

Note that the Product entity can also be returned by the Swader\Diffbot\Api\Analyze API in “product” mode, or in default mode when processing a URL that contains a product (auto-determined).

class Swader\Diffbot\Entity\Product
Swader\Diffbot\Entity\Product::__construct(array $data)
Parameters:
  • $data (array) – The data from which to build the Product entity

The Product entity’s constructor needs the data to populate its properties (see getters below). This class is automatically instantiated after a Swader\Diffbot\Api\Product or Swader\Diffbot\Api\Analyze call. You probably won’t ever need to manually create an instance of this class.

In the case of the Product entity, the constructor differs from the abstract one (Swader\Diffbot\Abstracts\Api::__construct) in that it also looks for the discussion key in the result, in order to build a Swader\Diffbot\Entity\Discussion sub-entity (see Swader\Diffbot\Entity\Product::getDiscussion).

Swader\Diffbot\Entity\Product::getType()
Returns:string

Will always return “product” for products:

// ... API setup ... //
$result = $api->call();

echo $result->getType(); // "product"
Swader\Diffbot\Entity\Product::getText()
Returns:string | null

Returns the plaintext content of the processed product page. HTML tags are stripped completely, images are removed. If the text property is missing in the result, returns null.

Swader\Diffbot\Entity\Product::getRegularPrice()
Returns:string

Returns regular price as string, e.g. “$23.99” or “32 kn”. If not found, returns offerPrice instead - see Swader\Diffbot\Entity\Product::getOfferPrice.

Swader\Diffbot\Entity\Product::getRegularPriceDetails()
Returns:array

Separates regularPrice into components like currency, amount, and full string. If not found, serves as alias for Swader\Diffbot\Entity\Product::getOfferPriceDetails.

Usage:

// ... API setup ... //
$result = $api->call();

var_dump($result->getRegularPriceDetails());

/**

array (size=3)
  'amount' => float 49.85
  'text' => string '£49.85' (length=7)
  'symbol' => string '£' (length=2)

**/
Swader\Diffbot\Entity\Product::getShippingAmount()
Returns:string

Returns shipping price as string, e.g. “$5.99”.

Swader\Diffbot\Entity\Product::getSaveAmount()
Returns:string

Returns difference between regular price and offer price, as string, e.g. “$5.99”.

Swader\Diffbot\Entity\Product::getSaveAmountDetails()
Returns:array

Separates saveAmount into components like currency, amount, and full string, much like Swader\Diffbot\Entity\Product::getRegularPriceDetails. One of the array keys is also a flag indicating whether or not the save amount is a percentage value.

Usage:

// ... API setup ... //
$result = $api->call();

var_dump($result->getSaveAmountDetails());

/**

array (size=4)
  'amount' => float 13.5
  'text' => string '£13.50' (length=7)
  'symbol' => string '£' (length=2)
  'percentage' => boolean false

**/
Swader\Diffbot\Entity\Product::getProductId()
Returns:string | null

Diffbot-determined unique product ID. If upc, isbn, mpn or sku are identified on the page, productId will select from these values in the above order. Null if none found.

Swader\Diffbot\Entity\Product::getUpc()
Returns:string | null

UPC number, if found.

Swader\Diffbot\Entity\Product::getMpn()
Returns:string | null

MPN number, if found.

Swader\Diffbot\Entity\Product::getIsbn()
Returns:string | null

ISBN number, if found.

Swader\Diffbot\Entity\Product::getSku()
Returns:string | null

Returns Stock Keeping Unit – store/vendor inventory number or identifier if available. If not, returns null.

Swader\Diffbot\Entity\Product::getSpecs()
Returns:array

If a specifications table or similar data is available on the product page, individual specifications will be returned in the specs object as name/value pairs. Names will be normalized to lowercase with spaces replaced by underscores, e.g. display_resolution.

If no specs table is found, an empty array will be returned.

Swader\Diffbot\Entity\Product::getImages()
Returns:array

An array of images found on the product page, with their details. The elements of the array are arrays like this one:

/**

array (size=7)
  'height' => int 512
  'diffbotUri' => string 'image|3|-851701004' (length=18)
  'naturalHeight' => int 727
  'width' => int 749
  'primary' => boolean true
  'naturalWidth' => int 1063
  'url' => string 'http://dab1nmslvvntp.cloudfront.net/wp-content/uploads/2014/07/140624455201.png' (length=79)

**/

Unlike the Swader\Diffbot\Api\Discussion API which returns details about discussion posts even when used with the Swader\Diffbot\Api\Product API, the image data returned with this method is minimal. For fuller details about images, use the Swader\Diffbot\Api\Image API.

Swader\Diffbot\Entity\Product::getPrefixCode()
Returns:string | null

Country of origin as identified by UPC/ISBN, e.g. “United Kingdom”. Null if not present.

Swader\Diffbot\Entity\Product::getProductOrigin()
Returns:string

If available, two-character ISO country code where the product was produced (e.g. “gb”). Null if not present.

Swader\Diffbot\Entity\Product::getPriceRange()
Returns:array | null

If the product is available in a range of prices, the minimum and maximum values will be returned. The lowest price will also be returned as the offerPrice (see Swader\Diffbot\Entity\Product::getOfferPrice). If no range is detected, returns null.

Swader\Diffbot\Entity\Product::getQuantityPrices()
Returns:array | null

If the product is available with quantity-based discounts, all identifiable price points will be returned. The lowest price will also be returned as the offerPrice (see Swader\Diffbot\Entity\Product::getOfferPrice). If no range is detected, returns null.

Swader\Diffbot\Entity\Product::isAvailable()
Returns:bool | null

Tries to determine whether or not the product is available / in stock. Returns boolean if determined, or null if not.

Swader\Diffbot\Entity\Product::getOfferPrice()
Returns:string

Returns price as string, e.g. “£49.85” or “32 kn”.

Swader\Diffbot\Entity\Product::getOfferPriceDetails()
Returns:array

Separates offerPrice into components like currency, amount, and full string.

Usage:

// ... API setup ... //
$result = $api->call();

var_dump($result->getOfferPriceDetails());

/**

array (size=3)
  'amount' => float 49.85
  'text' => string '£49.85' (length=7)
  'symbol' => string '£' (length=2)

**/
Swader\Diffbot\Entity\Product::getSize()
Returns:array | null

If product is available in different sizes, returns array of those sizes. Highly experimental and often unreliable. This field is optional, and needs to be set on the API. See Swader\Diffbot\Api\Product::setSize.

Swader\Diffbot\Entity\Product::getColors()
Returns:array | null

If the product is available in multiple colors, returns the color options. Highly experimental and often unreliable. This field is optional, and needs to be set on the API. See Swader\Diffbot\Api\Product::setColors.

Swader\Diffbot\Entity\Product::getBrand()
Returns:string

The brand of the product, as determined by Diffbot.

Swader\Diffbot\Entity\Product::getDiscussion()
Returns:Swader\Diffbot\Entity\Discussion | null

Returns the Swader\Diffbot\Entity\Discussion found on the product’s page (review section). See Swader\Diffbot\Api\Product::setDiscussion for details and below for usage:

use Swader\Diffbot\Diffbot;

$url = "http://www.sportsdirect.com/slazenger-plain-polo-shirt-mens-542006?colcode=54200601";

$diffbot = new Diffbot("my_token");
$api = $diffbot->createProductApi($url);

$result = $api->call();

echo $result->getDiscussion()->getNumPosts(); // 10
echo $result->getDiscussion()->getParticipants(); // 10

For other methods exposed on the Swader\Diffbot\Entity\Discussion entity, see its documentation.

Discussion API

This API is used to turn content like product reviews, comments on posts and forum threads into JSON. This API can be unleashed onto a forum / comment thread directly, or onto a product page / article page containing comments / reviews.

The Discussion API part of the Diffbot PHP client consists of three main classes: the API class, the Discussion Entity class, and the Post Entity class. We’ll describe them in order. Note that the API class extends Swader\Diffbot\Abstracts\Api, so be sure to read that first if you haven’t already.

Discussion API Class

class Swader\Diffbot\Api\Discussion

Basic Usage:

use Swader\Diffbot\Diffbot;

$url = 'http://some-article-to-process.com';

$diffbot = new Diffbot('my_token');
$api = $diffbot->createDiscussionApi($url);
Swader\Diffbot\Api\Discussion::setMaxPages($max = 1)
Parameters:
  • $max (int) – max number of pages to fetch
Returns:

$this

Set the maximum number of pages in a thread to automatically concatenate in a single response. Default = 1 (no concatenation). Set maxPages=all to retrieve all pages of a thread regardless of length. Each individual page will count as a separate API call.

Swader\Diffbot\Api\Discussion::setSentiment($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

This method sets the sentiment optional field value. This determines whether or not to return the sentiment score of the analyzed posts (each individual post gets one), a value ranging from -1.0 (very negative) to 1.0 (very positive). Sentiment analysis is powered by Semantria for advanced features like keyword and entity extraction, but the basic sentiment analysis (score only) is enabled for everyone, even those without Semantria accounts.

Usage:

$url = 'https://www.reddit.com/r/PHP/comments/3nl7g1/authentication_flow_in_a_microservice_architecture/';

// ...

$api->setSentiment(true);
$result = $api->cal();

// ...

echo $result->getPosts()[0]->getSentiment(); // -0.0789

Discussion Entity Class

When the Discussion API is done processing a URL, the result will be a Discussion Entity (i.e. a collection of one Discussion Entities inside an instance of Swader\Diffbot\Entity\EntityIterator).

For an overview of the abstract class all Entities build on, see Swader\Diffbot\Abstracts\Entity.

class Swader\Diffbot\Entity\Discussion
Swader\Diffbot\Entity\Discussion::__construct(array $data)
Parameters:
  • $data (array) – The data from which to build the Discussion object

The Article entity’s constructor needs the data to populate its properties (see getters below). This class is automatically instantiated after a Swader\Diffbot\Api\Discussion call. You probably won’t ever need to manually create an instance of this class.

Like Swader\Diffbot\Entity\Product and Swader\Diffbot\Entity\Article, the Discussion entity also has its own custom constructor, looking for the posts key inside of the return data, in order to create some nested Swader\Diffbot\Entity\Post objects.

Swader\Diffbot\Entity\Discussion::getType()
Returns:string

Will always return “discussion” for discussions:

// ... API setup ... //
$result = $api->call();

echo $result->getType(); // "discussion"
Swader\Diffbot\Entity\Discussion::getNumPosts()
Returns:int

Returns the number of posts found in the discussion. Only returns the number of posts in the fetched page range, so even if there are 100 posts over 20 pages, this method will return 5 if Swader\Diffbot\Api\Discussion::setMaxPages is still set to 1.

Swader\Diffbot\Entity\Discussion::getTags()
Returns:array

Returns an array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia and other data sources. Note that these are not the meta tags as defined in the page’s <head>, but machine learned ones:

// ... API setup ... //

$url = 'https://www.reddit.com/r/PHP/comments/3nl7g1/authentication_flow_in_a_microservice_architecture/';

// ...

$result = $api->call();

echo count($result->tags); // 5

var_dump($result->getTags);

/**

Output:
    array (size=5)
      0 =>
        array (size=5)
          'count' => int 5
          'prevalence' => float 0.11
          'score' => float 0.11
          'label' => string 'User (computing)' (length=16)
          'uri' => string 'http://dbpedia.org/resource/User_(computing)' (length=44)
      1 =>
        array (size=5)
          'count' => int 4
          'prevalence' => float 0.09
          'score' => float 0.09
          'label' => string 'Hypertext Transfer Protocol' (length=27)
          'uri' => string 'http://dbpedia.org/resource/Hypertext_Transfer_Protocol' (length=55)
      2 =>
        array (size=5)
          'count' => int 3
          'prevalence' => float 0.07
          'score' => float 0.07
          'label' => string 'POST (HTTP)' (length=11)
          'uri' => string 'http://dbpedia.org/resource/POST_(HTTP)' (length=39)
      3 =>
        array (size=5)
          'count' => int 2
          'prevalence' => float 0.04
          'score' => float 0.04
          'label' => string 'Object (computer science)' (length=25)
          'uri' => string 'http://dbpedia.org/resource/Object_(computer_science)' (length=53)
      4 =>
        array (size=5)
          'count' => int 2
          'prevalence' => float 0.04
          'score' => float 0.04
          'label' => string 'Coupling' (length=8)
          'uri' => string 'http://dbpedia.org/resource/Coupling' (length=36)
**/

Returns a maximum of 5.

Swader\Diffbot\Entity\Discussion::getParticipants()
Returns:int

The number of unique participants in the discussion.

Swader\Diffbot\Entity\Discussion::getNumPages()
Returns:int

Returns the number of pages if the discussion is a multi-page one. Read about auto-concatenation here and study the Swader\Diffbot\Api\Discussion::setMaxPages method for more details.

Swader\Diffbot\Entity\Discussion::getNextPages()
Returns:array

If the discussion is a multi-page one, returns the list of absolute URLs of the pages that follow after the one that was processed. If the discussion is a single-page one, an empty array is returned.

Swader\Diffbot\Entity\Discussion::getNextPage()
Returns:string | null

If the discussion is a multi-page one, returns the absolute subsequent page URL.

Swader\Diffbot\Entity\Discussion::getProvider()
Returns:string | null

Returns the provider of the comment / review system. This will be something like “disqus”, “facebook”, etc. In cases of forums and similar all-encompassing systems like Reddit, this method will return null.

Swader\Diffbot\Entity\Discussion::getRssUrl()
Returns:string | null

Returns the RSS feed URL for the discussion, if available.

Swader\Diffbot\Entity\Discussion::getConfidence()
Returns:float | null

A number from -1 to 1. Not sure what it does. Waiting for feedback from HQ. @todo find out what this is.

Swader\Diffbot\Entity\Discussion::getPosts()
Returns:array

Returns an array of Swader\Diffbot\Entity\Post objects, each built around the data in every individual post of a discussion. For post accessor methods, see below.

Discussion Post Class

class Swader\Diffbot\Entity\Post

Every Discussion entity has children - its posts. Every Post is its own entity, and very similar to Swader\Diffbot\Entity\Article, sharing many of its methods.

Swader\Diffbot\Entity\Post::getType()
Returns:string

Will always return “post” for posts.

Swader\Diffbot\Entity\Post::getLang()
Returns:string

Returns the language code of the detected language of the processed content. The code returned is a two-character ISO 639-1 code: http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

Swader\Diffbot\Entity\Post::getHumanLanguage()
Returns:string

Alias method for getLang() above.

Swader\Diffbot\Entity\Post::getText()
Returns:string | null

Returns the plaintext content of the processed post. HTML tags are stripped completely, images are removed. If the text property is missing in the result, returns null.

Swader\Diffbot\Entity\Post::getHtml()
Returns:string

Returns the full HTML content of the post. If the HTML property is missing in the result, returns null.

Swader\Diffbot\Entity\Post::getDate()
Returns:string

Returns date as per RFC 2616. Example date: “Wed, 18 Dec 2013 00:00:00 GMT”. Note that this is strtotime friendly for further conversions.

Swader\Diffbot\Entity\Post::getAuthor()
Returns:string | null

Returns the name of the author as written on the page. If Diffbot was unable to figure out who the author is, null is returned.

Swader\Diffbot\Entity\Post::getAuthorUrl()
Returns:string | null

If the author’s profile URL could be determined, this method will return it.

Swader\Diffbot\Entity\Post::getTags()
Returns:array

Returns an array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia and other data sources. Note that these are not the meta tags as defined by the author, but machine learned ones. Same thing as Swader\Diffbot\Entity\Article::getTags and Swader\Diffbot\Entity\Discussion::getTags.

Swader\Diffbot\Entity\Post::getSentiment()
Returns:float | null

Returns the sentiment score of the analyzed post text, a value ranging from -1.0 (very negative) to 1.0 (very positive). If sentiment score is absent (due to Diffbot being unable to determine it, or due to Swader\Diffbot\Api\Discussion::setSentiment being set to false, returns null.

Swader\Diffbot\Entity\Post::getVotes()
Returns:int

If a voting system exists and is easily discernible, Diffbot returns the number of upvotes on the post.

Swader\Diffbot\Entity\Post::getId()
Returns:int

Returns the ID of the post (usually the ordinary number of the post in the list of all posts, starting with 0 for the first one.

Swader\Diffbot\Entity\Post::getParentId()
Returns:int | null

If the post is a reply, this is the ID of the post it replies to. If not, null.

Swader\Diffbot\Entity\Post::getImages()
Returns:array

An array of images found in the post, with their details. The elements of the array are arrays like this one:

/**

array (size=7)
  'height' => int 512
  'diffbotUri' => string 'image|3|-851701004' (length=18)
  'naturalHeight' => int 727
  'width' => int 749
  'primary' => boolean true
  'naturalWidth' => int 1063
  'url' => string 'http://dab1nmslvvntp.cloudfront.net/wp-content/uploads/2014/07/140624455201.png' (length=79)

**/

The image data returned with this method is minimal. For fuller details about images, use the Swader\Diffbot\Api\Image API.

Swader\Diffbot\Entity\Post::getPageUrl()
Returns:string

Returns the URL which was processed (thread URL in most cases)

Image API

This API is used to turn content like image galleries, Instagram posts, or image-rich articles into JSON.

For examples of data that might be returned, please see http://diffbot.com and run the Image API demo.

The Image API part of the Diffbot PHP client consists of two main classes: the API class, and the Image Entity class. We’ll describe them in order. Note that the API class extends Swader\Diffbot\Abstracts\Api, so be sure to read that first if you haven’t already.

Image API Class

class Swader\Diffbot\Api\Image

Basic Usage:

use Swader\Diffbot\Diffbot;

$url = 'http://some-article-to-process.com';

$diffbot = new Diffbot('my_token');
$api = $diffbot->createImageApi($url);
Swader\Diffbot\Api\Image::setMentions($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

If set to true, the Image API will attempt to identify other locations online where the image was used - similar to Google Image reverse search.

Swader\Diffbot\Api\Image::setFaces($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

Highly experimental. Finds the x, y, height and width of coordinates of human faces.

Swader\Diffbot\Api\Image::setOcr($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

Highly experimental. If set, attempts to recognize and read text in the images.

Image Entity Class

When the Image API is done processing a URL, the result will be an instance of Swader\Diffbot\Entity\EntityIterator containing one Image Entity instance for every image found.

For an overview of the abstract class all Entities (including Image) build on, see Swader\Diffbot\Abstracts\Entity.

Note that the Image entities can also be returned by the Swader\Diffbot\Api\Analyze API in “image” mode, or in default mode when processing a URL that is essentially an image.

class Swader\Diffbot\Entity\Image
Swader\Diffbot\Entity\Image::getType()
Returns:string

Will always return “image” for images:

// ... API setup ... //
$result = $api->call();

echo $result->getType(); // "image"
Swader\Diffbot\Entity\Image::getHeight()
Returns:int

Height of image if resized by browser via CSS / JS. If not resized, serves as alias for Swader\Diffbot\Entity\Image::getNaturalHeight.

Swader\Diffbot\Entity\Image::getWidth()
Returns:int

Width of image if resized by browser via CSS / JS. If not resized, serves as alias for Swader\Diffbot\Entity\Image::getNaturalWidth.

Swader\Diffbot\Entity\Image::getNaturalHeight()
Returns:int

Raw image height, in pixels.

Swader\Diffbot\Entity\Image::getNaturalWidth()
Returns:int

Raw image width, in pixels.

Swader\Diffbot\Entity\Image::getUrl()
Returns:string

URL of the image

Swader\Diffbot\Entity\Image::getAnchorUrl()
Returns:string | null

URL the image links to, if any. Null if image isn’t linked.

Swader\Diffbot\Entity\Image::getXPath()
Returns:string

The XPath expression of the position of the image node in the DOM.

Swader\Diffbot\Entity\Image::getMentions()
Returns:array

Returns an array of [title => “title”, link => “link”] arrays for all posts where this image, or a similar one, was found. If not found, returns empty array.

Swader\Diffbot\Entity\Image::getFaces()
Returns:array | string

Finds the x, y, height and width of coordinates of human faces, returns array of arrays with those keys. In most cases, does not work at all and is in heavy alpha mode. Do not rely on this method for anything. Returns empty string if nothing found.

Swader\Diffbot\Entity\Image::getOcr()
Returns:string

The text recognized in the picture. In most cases, does not work at all and is in heavy alpha mode. Do not rely on this method for anything. Returns empty string if nothing found.

Analyze API

This API is a sort of “catch all” for all other API types in that it automatically determines the type of content being processed, and applies the appropriate API call to it.

This API will return entities matching the determined content type. For example, if you run Analyze API on a URL like www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/, the content type will be determined as “article” and it’ll be exactly as if you had called the Article API (Swader\Diffbot\Api\Article) on it.

Analyze API Class

class Swader\Diffbot\Api\Analyze
Swader\Diffbot\Api\Analyze::setDiscussion($bool = true)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

If set to false, will not extract article comments in a Discussion entity embedded in the Article / Product entity. By default, it will.

Swader\Diffbot\Api\Analyze::setMode($mode)
Parameters:
  • $mode (string) – “article”, “product”, “image” or “auto”
Returns:

$this

By default the Analyze API will fully extract all pages that match an existing Automatic API – articles, products or image pages. Set mode to a specific page-type (e.g., mode=article) to extract content only from that specific page-type. All other pages will simply return the default Analyze fields.

Usage with defaults:

use Swader\Diffbot\Diffbot;

$url = "www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/";

$diffbot = new Diffbot("my_token");
$api = $diffbot->createAnalyzeApi($url);

$result = $api->call();

echo $result->getAuthorUrl(); // "http://www.sitepoint.com/author/bskvorc/"

echo $result->getDiscussion()->getNumPosts(); // 7
echo $result->getDiscussion()->getProvider(); // Disqus

Usage with discussion off:

use Swader\Diffbot\Diffbot;

$url = "www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/";

$diffbot = new Diffbot("my_token");
$api = $diffbot->createAnalyzeApi($url);

$api->setDiscussion(false);
$result = $api->call();

echo $result->getAuthorUrl(); // "http://www.sitepoint.com/author/bskvorc/"

var_dump($result->getDiscussion()); // null

Usage with non-matching mode:

use Swader\Diffbot\Diffbot;

$url = "www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/";

$diffbot = new Diffbot("my_token");
$api = $diffbot->createAnalyzeApi($url);

$api->setMode("image");
$result = $api->call();

echo $result->getAuthorUrl(); // null
var_dump($result->getDiscussion()); // null

In the last example above, no data is available due to a mismatch in mode - using image parsing on an article entity does not produce any useful information.

Custom API

The Custom API is user defined in the Diffbot UI.

For a tutorial on creating a Custom API in the Diffbot UI, see here.

Custom API Class

class Swader\Diffbot\Api\Custom

When you have a Custom API ready on Diffbot’s end, you instantiate the Custom API class and pass in the Custom API name, along with the URL to process. Everything from that point on is identical to the other APIs, except the fact that instead of specific entities being returned, all Custom API calls return an iterator of Swader\Diffbot\Entity\Wildcard entities.

Swader\Diffbot\Api\Custom::__construct($url, $name)
Parameters:
  • $url (string) – The URL to process
  • $name (string) – The name of the API

The construct method is identical to the one in Swader\Diffbot\Abstracts\Api with one difference - it also needs the name of the Custom API in question, so that it can build the API URL to which the call will be dispatched when Swader\Diffbot\Abstracts\Api::call is called:

<?php

require_once '../vendor/autoload.php';

use Swader\Diffbot\Diffbot;

$diffbot = new Diffbot($my_token);

$url = 'http://sitepoint.com/author/bskvorc';
$api = $diffbot->createCustomApi($url, "AuthorFolio");

$result = $api->call();

echo $result->getBio(); // "Bruno is a coder from Croatia..."

In the example above, AuthorFolio is a custom API from this tutorial, which processes a SitePoint author’s portfolio. The getBio call works because of the magic methods in Swader\Diffbot\Abstracts\Entity which Swader\Diffbot\Entity\Wildcard inherits.

Wildcard Entity Class

class Swader\Diffbot\Entity\Wildcard

The Wildcard entity is returned when the type of a processed post does not match a type defined in the currently set EntityFactory (see Swader\Diffbot\Factory\Entity and Swader\Diffbot\Diffbot::setEntityFactory).

It is nothing more than a concretization of Swader\Diffbot\Abstracts\Entity and as such contains no additional methods.

In the example above, the getBio method is called on a Wilcard instance, returned by the call to the AuthorFolio. custom API.

Crawl API

Diffbot has the ability to crawl entire domains and process all crawled pages. For a difference between crawling and processing see here.

To programmatically create or update crawljobs, use this API.

A full tutorial on using this API can be found here, and a working app powered by it at http://search.sitepoint.tools.

The Crawl API is also known as the Crawlbot.

Crawl API Class

class Swader\Diffbot\Api\Crawl

The Crawl API is used to create new crawljobs or modify existing ones. The Crawl API is atypical, and as such does not extend Swader\Diffbot\Abstracts\Api unlike the more entity-specific APIs.

Note that everything you can do with the Crawl API can also be done in the Diffbot UI.

Swader\Diffbot\Api\Crawl::__construct($name = null, $api = null)
Parameters:
  • $name (string) – [Optional] The name of crawljob to be created or modified.
  • $api (Swader\Diffbot\Interfaces\Api) – [Optional] The API to use while processing the crawled links.

The $name argument is optional. If omitted, the second argument is ignored and the Swader\Diffbot\Api\Crawl::call will return a list of all crawljobs on a given Diffbot token, with their information, in a Swader\Diffbot\Entity\EntityIterator collection of Swader\Diffbot\Entity\JobCrawl instances.

The $api argument is also optional, but must be an instance of Swader\Diffbot\Interfaces\Api if provided:

<?php

// ... set up Diffbot

$api = $diffbot->createArticleApi('crawl');
$crawljob = $diffbot->crawl('myCrawlJob', $api);

// ... crawljob setup
// $crawljob->setSeeds( ... )

$crawljob->call();
Swader\Diffbot\Api\Crawl::getName()
Returns:string

Returns the unique name of the crawljob. This name is later used to download datasets, or to modify the job.

Swader\Diffbot\Api\Crawl::setApi($api)
Parameters:
Returns:

$this

The API cannot be modified after a crawljob has been created. This method is useless on existing crawljobs (see https://www.diffbot.com/dev/docs/crawl/api.jsp)

The $api passed into this class will be used on Diffbot’s end to process all the pages the crawljob provides. For example, if you set http://sitepoint.com as the seed URL (see Swader\Diffbot\Api\Crawl::setSeeds), and an instance of the Swader\Diffbot\Api\Article API as the $api argument, all pages found on http://sitepoint.com will be processed with the Article API. The results won’t be returned - rather, they’ll be saved on Diffbot’s servers for searching later (see Swader\Diffbot\Api\Search).

The other APIs require a URL parameter in their constructor, but when crawling, it is crawlbot who is providing the URLs. To get around this requirement, use the string “crawl” instead of a URL when instantiating a new API for use with the Crawl API:

// ...
$api = $diffbot->createArticleApi('crawl');
// ...
Swader\Diffbot\Api\Crawl::setSeeds(array $seeds)
Parameters:
  • $seeds (array) – An array of URLs (seeds) which to crawl for matching links
Returns:

$this

By default Crawlbot will restrict spidering to the entire domain (“http://blog.diffbot.com” will include URLs at “http://www.diffbot.com”):

// ...
$crawljob->setSeeds(['http://sitepoint.com', 'http://blog.diffbot.com']);
// ...
Swader\Diffbot\Api\Crawl::setUrlCrawlPatterns(array $pattern = null)
Parameters:
  • $pattern (array) – [Optional] Array of strings to limit pages crawled to those whose URLs contain any of the content strings.
Returns:

$this

You can use the exclamation point to specify a negative string, e.g. !product to exclude URLs containing the string “product,” and the ^ and $ characters to limit matches to the beginning or end of the URL.

The use of a urlCrawlPattern will allow Crawlbot to spider outside of the seed domain(s); it will follow all matching URLs regardless of domain:

// ...
$crawljob->setUrlCrawlPatterns(['!author', '!page']);
// ...
Swader\Diffbot\Api\Crawl::setUrlCrawlRegex($regex)
Parameters:
  • $regex (string) – a regular expression string
Returns:

$this

Specify a regular expression to limit pages crawled to those URLs that match your expression. This will override any urlCrawlPattern value.

The use of a urlCrawlRegEx will allow Crawlbot to spider outside of the seed domain; it will follow all matching URLs regardless of domain.

Swader\Diffbot\Api\Crawl::setUrlProcessPatterns(array $pattern = null)
Parameters:
  • $pattern (array) – [Optional] array of strings to search for in URLs
Returns:

$this

Only URLs containing one or more of the strings specified will be processed by Diffbot. You can use the exclamation point to specify a negative string, e.g. !/category to exclude URLs containing the string “/category,” and the ^ and $ characters to limit matches to the beginning or end of the URL.

Swader\Diffbot\Api\Crawl::setUrlProcessRegex($regex)
Parameters:
  • $regex (string) – Regular expression string
Returns:

$this

Specify a regular expression to limit pages processed to those URLs that match your expression. This will override any urlProcessPattern value.

Swader\Diffbot\Api\Crawl::setPageProcessPatterns(array $pattern = null)
Parameters:
  • $pattern (array) – [Optional] Array of strings
Returns:

$this

Specify strings to look for in the HTML of the pages of the crawled URLs. Only pages containing one or more of those strings will be processed by the designated API. Very useful for limiting processing to pages with a certain class present (e.g. class=article) to further narrow down processing scope and reduce expenses (fewer API calls).

Swader\Diffbot\Api\Crawl::setMaxHops($input = -1)
Parameters:
  • $input (int) – [Optional] Maximum number of hops
Returns:

$this

Specify the depth of your crawl. A maxHops=0 will limit processing to the seed URL(s) only – no other links will be processed; maxHops=1 will process all (otherwise matching) pages whose links appear on seed URL(s); maxHops=2 will process pages whose links appear on those pages; and so on. By default, Crawlbot will crawl and process links at any depth.

Swader\Diffbot\Api\Crawl::setMaxToCrawl($input = 100000)
Parameters:
  • $input (type) – [Optional] Maximum number of URLs to spider
Returns:

$this

Note that spidering (crawling) does not affect the API quota, and reducing this will only contribute to the length of a crawljob (it will be done faster if the limit is reached sooner). For a difference between crawling and processing see here.

Swader\Diffbot\Api\Crawl::setMaxToProcess($input = 100000)
Parameters:
  • $input (type) – [Optional] Maximum number of URLs to process
Returns:

$this

Useful for limiting the number of API calls made, thus reducing / limiting expenses. For a difference between crawling and processing see here.

Swader\Diffbot\Api\Crawl::notify($string)
Parameters:
  • $string (string) – Email or URL
Returns:

$this

Throws:

InvalidArgumentException if the input parameter is not a number

If input is email address, end a message to this email address when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes.

If input is URL, you will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the full JSON response in the POST body.

This method can be called once with an email and another time with a URL in order to define both an email notification hook and a URL notification hook. An InvalidArgumentException will be thrown if the argument isn’t a valid string (neither a URL nor an email address).

Swader\Diffbot\Api\Crawl::setCrawlDelay($input = 0.25)
Parameters:
  • $input (float) – [Optional] delay between crawljob repeat executions, in floating point seconds. Defaults to 0.25 seconds.
Returns:

$this

Throws:

InvalidArgumentException if the input parameter is not a number

Wait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number.

Swader\Diffbot\Api\Crawl::setRepeat($input)
Parameters:
  • $input (float) – The wait period between crawljob restarts, expressed in floating point days. E.g. 0.5 is 12 hours, 7 is a week, 14.5 is 2 weeks and 12 hours, etc. By default, crawls will not be repeated.
Returns:

$this

Throws:

InvalidArgumentException if the input parameter is not a number

Swader\Diffbot\Api\Crawl::setOnlyProcessIfNew($int = 1)
Parameters:
  • $int (int) – [Optional] a boolean flag represented as an integer
Returns:

return value

By default repeat crawls will only process new (previously unprocessed) pages. Set to 0 to process all content on repeat crawls.

Swader\Diffbot\Api\Crawl::setMaxRounds($input = 0)
Parameters:
  • $input (type) – [Optional] The param’s description
Returns:

return value

Specify the maximum number of crawl repeats. By default (maxRounds=0) repeating crawls will continue indefinitely.

Swader\Diffbot\Api\Crawl::setObeyRobots($bool = true)
Parameters:
  • $bool (bool) – [Optional] Either true or false
Returns:

$this

Ignores robots.txt if set to false

Swader\Diffbot\Api\Crawl::roundStart($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Force the start of a new crawl “round” (manually repeat the crawl). If onlyProcessIfNew is set to 1 (default), only newly-created pages will be processed. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::pause($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Pause a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::unpause($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Unpause a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::restart($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Restart a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::delete($commit = true)
Parameters:
  • $commit (bool) – [Optional] Either true or false
Returns:

$this | Swader\Diffbot\Entity\EntityIterator

Delete a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.

Swader\Diffbot\Api\Crawl::buildUrl()
Returns:string

This method is called automatically when Swader\Diffbot\Abstracts\Api::call is called. It builds the URL which is to be called by the HTTPClient in Swader\Diffbot\Diffbot::setHttpClient, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.

Usage:

$api-> // ... set up API
$myUrl = $api->buildUrl();
Swader\Diffbot\Api\Crawl::call()
Returns:Swader\Diffbot\Entity\EntityIterator

When the API instance has been fully configured, this method executes the call. If all went well, will return a collection of Swader\Diffbot\Entity\JobCrawl objects, each with information about a job under the current Diffbot token. How many get returned depends on the action that was performed - see below.

JobCrawl Class

The JobCrawl class is a container of information about a crawljob. If a crawljob is being created with the Crawl API, the Crawl API will return a single instance of JobCrawl with the information about the created job. If the Crawl API is being called without settings, returns all the token’s crawljobs - each in a separate instance. If the crawl job is being deleted, restarted, paused, etc, only the instance pertaining to the relevant crawljob is returned.
class Swader\Diffbot\Entity\JobCrawl
Swader\Diffbot\Entity\JobCrawl::getMaxToCrawl()
Returns:int

Maximum number of pages to crawl with this crawljob

Swader\Diffbot\Entity\JobCrawl::getMaxToProcess()
Returns:int

Maximum number of pages to process with this crawljob

Swader\Diffbot\Entity\JobCrawl::getOnlyProcessIfNew()
Returns:bool

Whether or not the job was set to only process newly found links, ignoring old but potentially updated ones

Swader\Diffbot\Entity\JobCrawl::getSeeds()
Returns:array

Seeds as given to the crawljob on creation. Returned as an array, suitable for direct insertion into a new crawljob via Swader\Diffbot\Api\Crawl::setSeeds

Search API

Diffbot’s Search API allows you to search the extracted content of one or all of your Diffbot “collections.” A collection is a discrete Crawlbot (Swader\Diffbot\Api\Crawl) or Bulk API job, and includes all of the web pages processed within that job.

In order to search a collection, you must first create that collection using either Crawlbot or the Bulk API. A collection can be searched before a crawl or bulk job is finished.

Whereas Crawlbot returns information about a specific crawljob, the Search API returns sets of matching documents from Diffbot’s database, depending on provided query parameters.

The API consists of two parts: the API class used to make the call and return the results, and the SearchInfo class as an alternative result, providing metadata about the query and the complete resultset. We’ll describe both, in order.

Note that the API class extends Swader\Diffbot\Abstracts\Api, so be sure to read that first if you haven’t already.

Search API Class

class Swader\Diffbot\Api\Search

This API class is a bit specific in that it only extends Swader\Diffbot\Abstracts\Api to inherit part of a single function - almost everything else is custom implemented, due to the highly specific nature of the API.

Basic usage:

use Swader\Diffbot\Diffbot;

$diffbot = new Diffbot('my_token');
$search = $diffbot->search('author:"Miles Johnson" AND type:article');
$result = $search->call();

foreach ($result as $article) {
    echo $article->getTitle();
}

$info = $search->call(true);

echo $info->getHits(); // 50
Swader\Diffbot\Api\Search::__construct()
Parameters:
  • $q (string) – Query string to run on the collection(s)

The constructor takes a string like “foo AND bar AND title:baz”. This would make the API search for documents containing both “foo” and “bar” in any of the fields, and “baz” in the title field.

Swader\Diffbot\Api\Search::setCol($col = null)
Parameters:
  • $col (string) – [Optional] Name of collection to search
Returns:

$this

If collection name is not provided, Search API will search all the collections under the currently active token.

Swader\Diffbot\Api\Search::setNum($num = 20)
Parameters:
  • $num (string|int) – Number of results to return
Returns:

$this

The $num param should either be a number, or the string “all” if you want the API to return all the results. Note that this may be quite a large payload if the search terms are broad, and you’d likely be better off paginating the result (see below).

Swader\Diffbot\Api\Search::setStart($start = 0)
Parameters:
  • $start (int) – The starting result number. Used during pagination.
Returns:

$this

Swader\Diffbot\Api\Search::buildUrl()
Returns:string

This method is called automatically when Swader\Diffbot\Abstracts\Api::call is called. It builds the URL which is to be called by the HTTPClient in Swader\Diffbot\Diffbot::setHttpClient, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.

Usage:

$api-> // ... set up API
$myUrl = $api->buildUrl();
Swader\Diffbot\Api\Search::call($info = false)
Parameters:
  • $info (bool) – Either true or false
Returns:

Swader\Diffbot\Entity\SearchInfo | Swader\Diffbot\Entity\EntityIterator

When the API instance has been fully configured, this method executes the call.

If the $info parameter passed into the method is false, the return value will be an iterable collection (Swader\Diffbot\Entity\EntityIterator) of appropriate entities. Refer to each API’s documentation for details on entities returned from each API call.

If you pass in true, you force info mode and get back a Swader\Diffbot\Entity\SearchInfo object related to the last call. Keep in mind that passing in true before calling a default call() will implicitly call the call(), and then get the SearchInfo.

So:

$searchApi->call(); // gets entities
$searchApi->call(true); // gets SearchInfo about the executed query

SearchInfo Entity Class

When the Search API is called with info mode forced, the API will return an info object, containing various properties useful for pagination and metadata.

class Swader\Diffbot\Entity\SearchInfo
Swader\Diffbot\Entity\SearchInfo::getType()
Returns:string

Will always return “searchInfo”:

// ... API setup ... //
$result = $api->call(true);

echo $result->getType(); // "searchInfo"
Swader\Diffbot\Entity\SearchInfo::getCurrentTimeUTC()
Returns:int

Current UTC time as timestamp

Swader\Diffbot\Entity\SearchInfo::getResponseTimeMS()
Returns:int

Response time in milliseconds. Time it took to process the query on Diffbot’s end.

Swader\Diffbot\Entity\SearchInfo::getNumResultsOmitted()
Returns:int

Number of results skipped for any reason

Swader\Diffbot\Entity\SearchInfo::getNumShardsSkipped()
Returns:int

Number of skipped shards (@todo find out what those are)

Swader\Diffbot\Entity\SearchInfo::getTotalShards()
Returns:int

Total number of shards (@todo find out what those are)

Swader\Diffbot\Entity\SearchInfo::getDocsInCollection()
Returns:int

Total number of documents in collection. Should resemble the total number you got on the crawl job. (@todo: find out why not identical)

Swader\Diffbot\Entity\SearchInfo::getHits()
Returns:int

Number of results that match - NOT the number of returned results! Use this for pagination as a total result count.

Swader\Diffbot\Entity\SearchInfo::getQueryInfo()
Returns:array

Returns an assoc. array containing the following keys and example values:

/**

"fullQuery" => "type:json AND (author:\"Miles Johnson\" AND type:article)",
"queryLanguageAbbr" => "xx",
"queryLanguage" => "Unknown",
"terms" => [
    [
    "termNum" => 0,
    "termStr" => "Miles Johnson",
    "termFreq" => 2621376,
    "termHash48" => 224575481707228,
    "termHash64" => 4150001371756911641,
    "prefixHash64" => 3732660069076179349
    ],
    [
    "termNum" => 1,
    "termStr" => "type:json",
    "termFreq" => 2621664,
    "termHash48" => 272064464231140,
    "termHash64" => 9877301297136722857,
    "prefixHash64" => 7586288672657224048
    ],
    [
    "termNum" => 2,
    "termStr" => "type:article",
    "termFreq" => 524448,
    "termHash48" => 210861560163398,
    "termHash64" => 12449358332005671483,
    "prefixHash64" => 7586288672657224048
    ]
]

**/

@todo: find out what hashes are, and to what the freq is relative

EntityFactory

The EntityFactory builds the Swader\Diffbot\Entity\EntityIterator by providing it with a collection of entities returned by an API, and a Guzzle Response which to consume. It implements the Swader\Diffbot\Interfaces\EntityFactory interface.

The only reason to build your own version of the EntityFactory is to provide it with instructions on how to pair API return types and entities you developed by extending Swader\Diffbot\Abstracts\Entity.

For a concrete example of this, see this tutorial on SitePoint, which demonstrates custom “AuthorFolio” and “SitePointArticle” entities automatically created by calls to a custom API.

class Swader\Diffbot\Factory\Entity

Swader\Diffbot\Factory\Entity::createAppropriateIterator($response)
Parameters:
  • $response (GuzzleHttp\Message\ResponseInterface) – The Guzzle response given by the Guzzle client after an API call’s execution
Returns:

Swader\Diffbot\Entity\EntityIterator

The only method publicly accessible, and the only method one needs to implement when building one’s own EntityFactory, createAppropriateIterator does what it says - it takes the Guzzle response provided to it, and builds an EntityIterator - a collection of Entities fitting for an API’s result.

EntityIterator

The EntityIterator is a collection object containing the appropriate entities (Swader\Diffbot\Abstracts\Entity) of each API.

For example, executing a Product API call on a URL with a product will actually return an EntityIterator instance with a single element instance - a Swader\Diffbot\Entity\Product. However, the EntityIterator also serves as a proxy to its first element, so accessing a property or a getter on the EntityIterator directly, will in fact access it on the first element. This allows for less verbose constructs. Compare:

$result = $api->call();
echo $result->getAuthor();

And:

$result = $api->call();
foreach ($result as $entity) {
    echo $entity->getAuthor();
}

Assuming we called the Product API, the above snippets are identical logically, because the Product API only returns a single Product entity.

As evident above, the EntityIterator also acts as an array, and thus can be fully iterated through for when APIs return sets of entities rather than just one (see Swader\Diffbot\Api\Image).

class Swader\Diffbot\Entity\EntityIterator

Swader\Diffbot\Entity\EntityIterator::__construct(array $objects, $response)
Parameters:
  • $objects (array) – An array of entities returned by the API
  • $response (GuzzleHttp\Message\ResponseInterface) – The original Response object returned by the API, useful for getting raw data if you need to additionally process results

The EntityIterator is automatically constructed by Swader\Diffbot\Factory\Entity - you’ll almost never need to instantiate it yourself. It needs an array of objects, which is an array of Entities (Swader\Diffbot\Abstracts\Entity) through which one can then iterate when processing results, and a Guzzle Response object which one can use to process the raw return data. See below.

Swader\Diffbot\Entity\EntityIterator::getResponse()
Returns:GuzzleHttp\Message\ResponseInterface

Returns the original Guzzle Response object returned by the Guzzle Client after the API call:

$result = $api->call();
var_dump($result->getResponse()->json());

Exceptions

This document contains the descriptions and throw cases for all exceptions in the client. Use this reference when you’re unsure why you may have gotten an exception.

DiffbotException

exception Swader\Diffbot\Exceptions\DiffbotException

The DiffbotException is an empty exception class that extends the base PHP \Exception. It is the base for all other Diffbot exceptions - though currently, it is the only one.

Its main current purpose is to let the user know that something went wrong with the client, not its dependencies or the application consuming it.

Interfaces

This document contains the descriptions for all interfaces used in the client.

Api

interface Swader\Diffbot\Interfaces\Api

The API interface is there as a contract for developing custom APIs, not unlike the Swader\Diffbot\Api\Custom class.

Swader\Diffbot\Interfaces\Api::setTimeout($timeout = 30000)
Parameters:
  • $timeout (int) – The timeout value in milliseconds. Defaults to 30000 (30 seconds)
Returns:

$this

All Diffbot API endpoints support a timeout parameter which tells them after how many milliseconds to stop expecting a response from the page being processed.

Swader\Diffbot\Interfaces\Api::call()
Returns:Swader\Diffbot\Entity\EntityIterator

The call method should execute the remote call to the API. It must return an instance of Swader\Diffbot\Entity\EntityIterator containing the set of appropriate entities for the return value of said API. In custom APIs, these are usually Swader\Diffbot\Entity\Wildcard entities, unless otherwise specified via a custom implementation of Swader\Diffbot\Interfaces\EntityFactory.

Swader\Diffbot\Interfaces\Api::buildUrl()
Returns:string

This method is called automatically when Swader\Diffbot\Interfaces\Api::call is called. It builds the URL which is to be called by the HTTPClient in Swader\Diffbot\Diffbot::setHttpClient, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.

EntityFactory

interface Swader\Diffbot\Interfaces\EntityFactory

The EntityFactory interface is there as a contract for developing custom Entity Factories. For example, you may want to make sure that a call to an API returns specific entities rather than Swader\Diffbot\Entity\Wildcard, or some of the predefined ones like Swader\Diffbot\Entity\Product. A specific example would be having a custom API which processes a site with board game cards. Each card has a specific value at a specific location, and these values may correspond. Rather than manually process data in Swader\Diffbot\Entity\Wildcard entities after a call to this custom API, you might want to define a GameCard entity and give it fields and methods specific to the context. A custom entity factory is then used to bind the newly defined entity with the custom API.

Swader\Diffbot\Interfaces\EntityFactory::createAppropriateIterator($response)
Parameters:
  • $response (GuzzleHttp\Message\ResponseInterface) – The response received from the API call. Must be of the GuzzleHttp v5 type. Automatic if the Guzzle client is used, but version 5 only.
Returns:

Swader\Diffbot\Entity\EntityIterator

Returns the entity iterator containing the appropriate entities as built by the contents of $response.

Traits

All the traits used in the Diffbot PHP client are described in this one document.

DiffbotAware

trait Swader\Diffbot\Traits\DiffbotAware

The DiffbotAware trait is there to make the API classes spawned by Diffbot aware of their parent, so that common configuration values and other factories can be accessed even after an API class has been instantiated.

Unless you’re implementing your own API class which doesn’t extend the \Swader\Diffbot\Abstracts\Api, you won’t need this.

Swader\Diffbot\Traits\DiffbotAware::registerDiffbot($d)
Parameters:
  • $d (\Swader\Diffbot\Diffbot) – Swader\Diffbot\Diffbot - an instance of the Diffbot main class to inject into children, like instances of various API classes.
Returns:

$this

StandardApi

trait Swader\Diffbot\Traits\StandardApi

The StandardApi trait contains some methods common to most, if not all, API classes. These methods are setters for fields which appear in every Diffbot API: links, breadcrumb, meta and querystring. More information available under optional fields in various API doc files.

Swader\Diffbot\Traits\StandardApi::setBreadcrumb($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

Sets the breadcrumb optional field to true. The API then returns a top-level array (breadcrumb) of URLs and link text from page breadcrumbs.

Swader\Diffbot\Traits\StandardApi::setQuerystring($bool)
Parameters:
  • $bool (bool) – Either true or false
Returns:

$this

Sets the querystring optional field to true. The API then returns any key/value pairs present in the URL querystring. Items without a discrete value will be returned as true.

StandardEntity
trait Swader\Diffbot\Traits\StandardEntity

The StandardEntity trait is here to add some common methods to the various entities. These make sense only in the standard entities, i.e. the data formats returned by Diffbot, which is why they aren’t in the abstract \Swader\Diffbot\Abstracts\Entity class. You probably won’t need this trait unless you define a \Swader\Diffbot\Api\Custom API which offers fields of the same names as those returned by the getters in this trait.

Swader\Diffbot\Traits\StandardEntity::getLang()
Returns:string

Returns the language code of the detected language of the processed content. The code returned is a two-character ISO 639-1 code: http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes

Swader\Diffbot\Traits\StandardEntity::getHumanLanguage()
Returns:string

Alias method for getLang() above.

Swader\Diffbot\Traits\StandardEntity::getPageUrl()
Returns:string

Returns the URL which was processed

Swader\Diffbot\Traits\StandardEntity::getResolvedPageUrl()
Returns:string

Returns page URL which was resolved by redirects, if any. Will often be identical to result from getPageUrl above.

Swader\Diffbot\Traits\StandardEntity::getTitle()
Returns:string

Returns the title of the document which was processed.

Swader\Diffbot\Traits\StandardEntity::getMeta()
Returns:array | null

Returns an array containing the full contents of page meta tags, including sub-arrays for OpenGraph tags, Twitter Card metadata, schema.org microdata, and – if available – oEmbed metadata. If the Swader\Diffbot\Traits\StandardApi::setMeta method was not called, will return null.

Swader\Diffbot\Traits\StandardEntity::getBreadcrumb()
Returns:array | null

Returns a top-level array (breadcrumb) of URLs and link text from page breadcrumbs. If the Swader\Diffbot\Traits\StandardApi::setBreadcrumb method was not called, will return null.

Swader\Diffbot\Traits\StandardEntity::getQueryString()
Returns:array | null

Returns any key/value pairs present in the URL querystring. Items without a discrete value will be returned as true. If the Swader\Diffbot\Traits\StandardApi::setQuerystring method was not called, will return null.

Swader\Diffbot\Traits\StandardEntity::getDiffbotUri()
Returns:string

A unique identifier of the entity in Diffbot’s database. Useful for filtering out duplicates, caching, etc.