Diffbot PHP Client Documentation¶
This documentation is intended for Diffbot’s PHP client - for general Diffbot docs, see Diffbot Documentation.
If this is your first time encountering the PHP client, it’s recommended you read the overview.
maintoc¶
Contents:
Overview¶
Diffbot¶
Diffbot is a visual machine learning AI which processes renders of web pages to generate structured JSON entities.
In other words, you give Diffbot a URL and it returns human-readable data about it, but doesn’t rely on what it finds in the source code - rather, it reads the renders like humans do, visually extracting the human-directed content to provide reliable information about what’s actually being said on the page being processed. In that effect, it is relatively untrickable by over-optimized SEO meta content.
Diffbot exposes its services via a set of API endpoints.
To read more about Diffbot, see the official documentation, or some of the following tutorials:
Diffbot PHP Client¶
The Diffbot PHP Client is the official PHP wrapper for the API endpoints Diffbot provides.
By using the PHP client, the developer can interact with both the APIs and the returned entities in an object oriented manner, rather than parse JSON and extract data manually. The PHP client uses Guzzle to issue requests to the API. It is currently built on top of Guzzle 5, and there are no immediate plans to transition to the feature-lacking version 6.
Quickstart¶
Install via Composer:
composer require swader/diffbot-php-client
Create a Diffbot instance, provide a token, specify the URL you want to process, and use all this to create an instance of the API endpoint:
$diffbot = new Diffbot('my_token');
$url = 'http://www.sitepoint.com/diffbot-crawling-visual-machine-learning/';
$articleApi = $diffbot->createArticleAPI($url);
Configure the API call with some setters (all will be explained in this documentation) and issue the call:
$processedArticle = $articleApi->setDiscussion(false);
Consume the resulting data entity any way you see fit:
echo $processedArticle->author; // Bruno Skvorc
Quicklinks¶
Here’s a list sub-guides for this PHP client which may be useful depending on your specific needs:
- Products API - if you need to parse online products like webshop content, auction site pages, etc.
- Articles API - if you need to parse online posts like news sites, blogs, tutorials, and other prose.
- Discussion API - for parsing forum topics, comment threads, and other back-and-forth forms of communication.
- Analyze API - If you don’t know what you’re parsing, and want to rely on Diffbot’s intuition to figure it out and auto-apply the correct API (one of the above).
- Image API - if you’re planning to parse an image-heavy site and want them all returned, along with extra data. Think galleries, pinterest pages, instagram...
- Custom API - if you built your own API on Diffbot and want to use it with the client. Works well with EntityFactory.
- Crawl API - if you want to apply any of the above on a massive number of URLs at once.
- Search API - if you want to search the results produced by running the Crawl API above.
Whatever your goal, make sure you read the main Diffbot file first.
Diffbot Class¶
The Diffbot class is the first instance a developer must create when using the client. It serves as a container for global settings, and as a factory for the various API endpoint classes.
-
class
Swader\Diffbot\
Diffbot
¶ The Diffbot class takes a single optional argument, the
$token
, which can be obtained here. Instantiate like so:$diffbot = new Diffbot("my_token");
Alternatively, set the token globally, and instantiate without passing in the parameter:
Diffbot::setToken("my_token"); $diffbot = new Diffbot();
Note that if you instantiate without a global token set, and don’t pass in a token while instantiating either, you’ll get a
Swader\Diffbot\Exceptions\DiffbotException
thrown.
setToken¶
- static
Swader\Diffbot\Diffbot::
setToken
($token)¶
Parameters:
- $token (string) – The token.
Returns: void, or throws an \InvalidArgumentException if the token is invalid
Useful for setting a default token for all future instances.
Usage:
Diffbot::setToken("my_token");
getToken¶
Swader\Diffbot\Diffbot::
getToken
()¶
Returns: null or string Returns either the instance token, or the globally defined one - or null if neither is defined
Usage:
echo $diffbot->getToken(); // "my_token"
setHttpClient¶
Swader\Diffbot\Diffbot::
setHttpClient
(GuzzleHttp\Client $client)¶
Parameters:
- $client (GuzzleHttp\Client) – The HTTP client.
Returns: $this
Allows changing of HTTP clients used to send requests to the Diffbot API. Generally useful only during testing, but some edge cases may arise. This method does not need to be called for Diffbot to be usable - it will default to a new instance of the regular GuzzleHttpClient.
Usage:
$client = new GuzzleHttp\Client(); $diffbot->setHttpClient($client);
getHttpClient¶
Swader\Diffbot\Diffbot::
getHttpClient
()¶Returns the currently set HTTP client. Can be changed via
Swader\Diffbot\Diffbot::setHttpClient
.
Returns: GuzzleHttp\Client
setEntityFactory¶
Swader\Diffbot\Diffbot::
setEntityFactory
($factory)¶
Parameters:
- $factory (Swader\Diffbot\Interfaces\EntityFactory) – A
Swader\Diffbot\Interfaces\EntityFactory
implementation.Returns: $this
Allows for changing the entity factory in use when returning and processing Diffbot-provided data. A custom Entity Factory might, for example, return Author entities (also custom) for all calls to a custom API set up in a user’s Diffbot account. This helps with getting fully consumable custom data right from the API source, rather than requiring additional processing.
If not explicitly set, defaults to built-in
Swader\Diffbot\Factory\Entity
.Usage:
$newEntityFactory = new \My\Custom\EntityFactory(); $diffbot = new Diffbot('my_token'); $diffbot->setEntityFactory($newEntityFactory); // @todo: Full tutorial about a custom Entity and EntityFactory
getEntityFactory¶
Swader\Diffbot\Diffbot::
getEntityFactory
()¶
Returns: Swader\Diffbot\Interfaces\EntityFactory
Returns the currently defined
Swader\Diffbot\Interfaces\EntityFactory
instance. This method generally isn’t needed outside of testing scenarios. See above for usage of the setter.
createProductApi¶
Swader\Diffbot\Diffbot::
createProductApi
($url)¶
Parameters:
- $url (string) – URL which is to be processed, or the word “crawl”
Returns: The product API turns web shops, catalogs, etc. into structured JSON (think eBay, Amazon...). This method creates an instance of the
Swader\Diffbot\Api\Product
class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with theSwader\Diffbot\Diffbot::crawl
method (see below). For a detailed directory of available methods and in depth usage examples, see theSwader\Diffbot\Api\Product
documentation.Usage:
$api = $diffbot->createProductApi("http://www.amazon.com/Oh-The-Places-Youll-Go/dp/0679805273/"); $result = $api->call(); echo $result->offerPrice; // $11.99 echo $result->getIsbn(); // 0679805273
createArticleApi¶
Swader\Diffbot\Diffbot::
createArticleApi
($url)¶
Parameters:
- $url (string) – URL which is to be processed, or the word “crawl”
Returns: The article API turns online news posts, blog articles, etc. into structured JSON. This method creates an instance of the
Swader\Diffbot\Api\Article
class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with theSwader\Diffbot\Diffbot::crawl
method (see below). For a detailed directory of available methods and in depth usage examples, see theSwader\Diffbot\Api\Article
documentation.Usage:
$api = $diffbot->createArticleApi("http://techcrunch.com/2012/05/31/diffbot-raises-2-million-seed-round-for-web-content-extraction-technology/"); $result = $api->call(); echo $result->publisherCountry; // United States echo $result->getAuthor(); // Sarah Perez
createImageApi¶
Swader\Diffbot\Diffbot::
createImageApi
($url)¶
Parameters:
- $url (string) – URL which is to be processed, or the word “crawl”
Returns: The image API finds images in a post and returns them as JSON. This method creates an instance of the
Swader\Diffbot\Api\Image
class. The method accepts a single string as a parameter: either a URL which to process for images, or the word “crawl” if used in conjunction with theSwader\Diffbot\Diffbot::crawl
method (see below). For a detailed directory of available methods and in depth usage examples, see theSwader\Diffbot\Api\Image
documentation. Note that unlike Product and Article, the Image API can return several Image entities (see usage below). If not iterated through, the result refers to the first image only.Usage:
$api = $diffbot->createImageApi("http://smittenkitchen.com/blog/2012/01/buckwheat-baby-with-salted-caramel-syrup/"); $result = $api->call(); echo $result->naturalHeight; // 333 foreach ($result as $image) { echo $result->title; echo $result->getXPath(); }
createAnalyzeApi¶
Swader\Diffbot\Diffbot::
createAnalyzeApi
($url)¶
Parameters:
- $url (string) – URL which is to be processed, or the word “crawl”
Returns: The analyze API tries to autodetect the content it’s dealing with (image, product, article...) and extracts it into structured JSON. This method creates an instance of the
Swader\Diffbot\Api\Analyze
class. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with theSwader\Diffbot\Diffbot::crawl
method (see below). The Analyze API is the default API used duringSwader\Diffbot\Diffbot::crawl
mode.Usage:
$api = $diffbot->createAnalyzeApi("http://techcrunch.com/2012/05/31/diffbot-raises-2-million-seed-round-for-web-content-extraction-technology/"); $result = $api->call(); echo $result->publisherCountry; // United States echo $result->getAuthor(); // Sarah Perez
createDiscussionApi¶
Swader\Diffbot\Diffbot::
createDiscussionApi
($url)¶
Parameters:
- $url (string) – URL which is to be processed, or the word “crawl”
Returns: The discussion API turns online comments, forum topics or pages of reviews into structured JSON. Think Amazon review section, Youtube comments, article Disqus comments, etc. This method creates an instance of the
Swader\Diffbot\Api\Discussion
. The method accepts a single string as a parameter: either a URL which to process, or the word “crawl” if used in conjunction with theSwader\Diffbot\Diffbot::crawl
method (see below). Like the Image API above, this one also returns severalSwader\Diffbot\Api\Discussion
entities per call, if available, along with other data - see usage below.Usage:
$api = $diffbot->createDiscussionApi("http://boards.straightdope.com/sdmb/showthread.php?t=740315"); $result = $api->call(); echo $result->numPosts; // 43 echo $result->getParticipants(); // 23 foreach ($result as $post) { echo $post->getAuthor(); echo $post->votes; }
createCustomApi¶
Swader\Diffbot\Diffbot::
createCustomApi
($url, $name)¶
Parameters:
- $url (string) – URL which is to be processed, or the word “crawl”
- $name (string) – Name of the custom API as defined in the Diffbot UI
Returns: Diffbot customers can define Custom APIs. For a tutorial on doing this, see here. What it comes down to, is that you can tell Diffbot how to recognize certain areas of a web page, and have it translate that into JSON for you if none of the standard APIs do the trick. This allows for much more lightweight and specific calls, resulting in a quicker turnaround and (usually) more precise data. This method creates an instance of the
Swader\Diffbot\Api\Custom
. The method accepts two parameters: either a URL which to process, or the word “crawl” if used in conjunction with theSwader\Diffbot\Diffbot::crawl
method (see below), and the name of the custom API to use. Unlike other APIs, this one has no specific entity to return and instead returns aSwader\Diffbot\Entity\Wildcard
entity which matches anything.Usage:
$api = $api->createCustomApi("http://sitepoint.com/author/bskvorc", "AuthorFolio"); $result = $api->call(); echo $result->bio; // Bruno is a coder from Croatia with Master's Degrees in...
crawl¶
Swader\Diffbot\Diffbot::
crawl
($name = null, Swader\Diffbot\Api $api = null)¶
Parameters:
- $name (string) – Name of the new crawljob. If omitted, activates read only mode and returns joint data about all defined crawljobs for the current Diffbot token.
- $api (Swader\Diffbot\Api) – Instance of the API to process the crawled URLs. If omitted, defaults to
Swader\Diffbot\Api\Analyze
.Returns: The crawl method is used to create new Crawlbot job (crawljob). To find out more about Crawlbot and what, how and why it does what it does, see here. I also recommend reading the Crawlbot API docs and the Crawlbot support topics just so you can dive right in without being too confused by the code below.
In a nutshell, the Crawlbot crawls a set of seed URLs for links (even if a subdomain is passed to it as seed URL, it still looks through the entire main domain and all other subdomains it can find) and then processes all the pages it can find using the API you define (or opting for Analyze API by default). The result of the call is a collection of
Swader\Diffbot\Entity\JobCrawl
objects, each with details about a defined job. To actually get data obtained by crawling and processing, use theSwader\Diffbot\Diffbot::search
API.Here’s how you can create a crawljob (see detailed
Swader\Diffbot\Api\Search
for a step by step guide with explanations):$url = 'crawl'; $articleApi = $diffbot->createArticleAPI($url)->setDiscussion(false); $crawl = $diffbot->crawl('mycrawl_01', $articleApi); $crawl->setSeeds(['http://sitepoint.com']); $job = $crawl->call(); // See JobCrawl class to find out which getters are available dump($job->getDownloadUrl("json")); // outputs download URL to JSON dataset of the job's result
search¶
Swader\Diffbot\Diffbot::
search
($q)¶
Parameters:
- $q (string) – The query to execute against the Search API
Returns: The Search API is used to search through sets of crawled and processed data obtained through the use of the Crawl or Bulk API. It accepts a simple string query, and returns an array of all matching entities. For a live example of crawl + search implemenation, see here, and for a full walkthrough of the Search API, see the
Swader\Diffbot\Api\Search
docs.Usage:
$search = $diffbot->search('author:"Miles Johnson" AND type:article'); $result = $search->call(); foreach ($result as $article) { echo $article->getTitle(); }
API Abstract¶
This page will describe the API Abstract class - the one which all the API classes extend to get some common functionality. Use this to build your own API class for custom APIs you defined in the Diffbot UI.
-
class
Swader\Diffbot\Abstracts\
Api
¶
__construct¶
Swader\Diffbot\Abstracts\Api::
__construct
($url)¶
Parameters:
- $url (string) – The URL of the page to process
Throws: InvalidArgumentException if the URL is invalid AND not the word “crawl”
This class takes a single argument during construction, the URL of the page to process. Alternatively, the argument can be “crawl”, if the API is to be used in conjunction with the
Swader\Diffbot\Api\Crawl
API.
setTimeout¶
Swader\Diffbot\Abstracts\Api::
setTimeout
($timeout = 30000)¶
Parameters:
- $timeout (int) – Optional. The timeout, in milliseconds. Defaults to 30,000, a.k.a. 30 seconds
Returns: $this
Throws: InvalidArgumentException if the timeout value is invalid (negative or not an integer)
Setting the timeout will define how long Diffbot will keep trying to fetch the API results. A timeout can happen for various reasons, from Diffbot’s failure, to the site being crawled being exceptionally slow, and more.
Usage:
$api->setTimeout(40000);
call¶
Swader\Diffbot\Abstracts\Api::
call
()¶
Returns: Swader\Diffbot\Entity\EntityIterator
The return value will be an iterable collection of appropriate entities. Refer to each API’s documentation for details on entities returned from each API call.When the API instance has been fully configured, this method executes the call.
Usage:
$result = $api->call(); foreach ($result as $entity) { /* ... */ }
buildUrl¶
Swader\Diffbot\Abstracts\Api::
buildUrl
()¶
Returns: string This method is called automatically when
Swader\Diffbot\Abstracts\Api::call
is called. It builds the URL which is to be called by the HTTPClient inSwader\Diffbot\Diffbot::setHttpClient
, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.Usage:
$api-> // ... set up API $myUrl = $api->buildUrl();
Entity Abstract¶
This page will describe the Entity Abstract class. This class is the root of all Entity classes. Entity classes are used as containers for return values from various API endpoints. For example, the Article API will return an Article Entity, the Discussion API will return a Discussion Entity, and so on.
It is important to note that an API class will never return an Entity class directly. Rather, it will return an Swader\Diffbot\Entity\EntityIterator
, an iterable container with all the Entities inside. The container, however, is configured in such a way that executing get methods on it directly will forward those calls to the first Entity in its dataset. See Swader\Diffbot\Entity\EntityIterator
.
-
class
Swader\Diffbot\Abstracts\
Entity
¶
__construct¶
Swader\Diffbot\Abstracts\Entity::
__construct
(array $data)¶This class takes a single argument during construction, an array of data. This data is then turned into gettable information by means of getters, both direct and magic. Some getters do additional processing of the data in order to make it more useful to the user.
Parameters:
- $data (array) – The data
getData¶
Swader\Diffbot\Abstracts\Entity::
getData
()¶Returns the raw data passed into the Entity by the parent API class. This will be an associative array (see Usage below).
Returns: array Usage:
// ... $data = $article->getData(); echo $data['title']; echo $data['author']; // etc.
__call¶
Swader\Diffbot\Abstracts\Entity::
__call
()¶Magic method for resolving undefined getters and only getters. If the method being called starts with
get
, the remainder of its name will be turned into a key to search inside the $data property (seegetData
). Once the call is identified as a getter call,__get
is invoked (see below).
Returns: mixed Throws: BadMethodCallException if the prefix of the method is not get
__get¶
Swader\Diffbot\Abstracts\Entity::
__get
()¶This method is called automatically when
__call
is called. It looks for the property being asked for inside the$data
property of the current class, or returns null if not found.
returns: string Usage:
$api-> // ... set up API $myUrl = $api->buildUrl();
Article API¶
This API is used to turn content like blog posts, news articles, and other prose into JSON.
For examples of data that might be returned, please see http://diffbot.com and run the Article API demo.
The Article API part of the Diffbot PHP client consists of two main classes: the API class, and the Article Entity class. We’ll describe them in order. Note that the API class extends Swader\Diffbot\Abstracts\Api
, so be sure to read that first if you haven’t already.
Article API Class¶
-
class
Swader\Diffbot\Api\
Article
¶
Basic Usage:
use Swader\Diffbot\Diffbot;
$url = 'http://some-article-to-process.com';
$diffbot = new Diffbot('my_token');
$api = $diffbot->createArticleApi($url);
setSentiment¶
Swader\Diffbot\Api\Article::
setSentiment
($bool)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
This method sets the
sentiment
optional field value. This determines whether or not to return the sentiment score of the analyzed article text, a value ranging from -1.0 (very negative) to 1.0 (very positive). Sentiment analysis is powered by Semantria for advanced features like keyword and entity extraction, but the basic sentiment analysis (score only) is enabled for everyone, even those without Semantria accounts.Usage:
$url = 'http://www.sitepoint.com/diffbot-crawling-visual-machine-learning/'; // ... $api->setSentiment(true); $result = $api->cal(); // ... echo $result->sentiment; // -0.0979
setPaging¶
Swader\Diffbot\Api\Article::
setPaging
($bool = true)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
If set to false, Diffbot will not auto-concatenate several pages of a multi-page article into one. Defaults to true, max 20 pages.
For more info about auto-concatenation, see here.
While practical, this is a less reliable method of concatenating long posts than finding out the number of pages manually and processing them each one by one. Not only does it often fail to recognize the next page links, but also if there’s a chance that the series is longer than 20 parts, everything from 20 onward will remain ignored. This is a limitation of Diffbot, not the client, and there’s little chance of it changing - concatenations longer than 20 pages would likely trigger timeouts as the page count becomes less and less trivial.
If you need to process multiple pages of something, it is thus recommended you find out those links yourself, then pass them into Article API one by one and concatenate later. If you’d like to analyze the entire concatenated post after the fact, it’s best to manually concat and then send the merged content into Diffbot as a POST value for processing.
Usage:
$url = 'http://www.some-seven-part-article.com/'; // ... $api->setPaging(true); $result = $api->cal(); // ... echo $result->numPages; // 7
setMaxTags¶
Swader\Diffbot\Api\Article::
setMaxTags
($max = 5)¶
Parameters:
- $max (int) – The number of tags to generate and return
Returns: $this
Set the maximum number of automatically-generated tags to return. By default a maximum of five tags will be returned. Tags are a built-in feature of Diffbot, and could generate different results on two different calls to the same URL provided enough time has passed, due to Diffbot’s engine evolving over time as it processed more and more content.
For an example of what the tags might look like, run the demo example at https://diffbot.com or see
Swader\Diffbot\Entity\Article::getTags
.
setDiscussion¶
Swader\Diffbot\Api\Article::
setDiscussion
($bool = true)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
Whether or not to use the Discussion API to additionally process any detected comment or review threads in the article. Behaves as if the
Swader\Diffbot\Api\Discussion
was set to process the page, and merges the returned data with the Article API’s results by means of adiscussion
field in the result. The field will have all the sub-fields of the usualSwader\Diffbot\Api\Discussion
call; i.e. you will be able to access theSwader\Diffbot\Entity\Discussion
entity and all its sub entities via theSwader\Diffbot\Entity\Article::getDiscussion
method.
Article Entity Class¶
When the Article API is done processing an article (or several) the result will be an Article Entity (i.e. a collection of one Article Entities inside an instance of Swader\Diffbot\Entity\EntityIterator
).
For an overview of the abstract class all Entities build on, see Swader\Diffbot\Abstracts\Entity
.
Note that the Article entity can also be returned by the Swader\Diffbot\Api\Analyze
API in “article” mode, or in default mode when processing a URL that contains an article (auto-determined).
-
class
Swader\Diffbot\Entity\
Article
¶
__construct¶
Swader\Diffbot\Entity\Article::
__construct
(array $data)¶
Parameters:
- $data (array) – The data from which to build the Article entity
The Article entity’s constructor needs the data to populate its properties (see getters below). This class is automatically instantiated after an
Swader\Diffbot\Api\Article
orSwader\Diffbot\Api\Analyze
call. You probably won’t ever need to manually create an instance of this class.In the case of the Article entity, the constructor differs from the abstract one (
Swader\Diffbot\Abstracts\Api::__construct
) in that it also looks for the discussion key in the result, in order to build aSwader\Diffbot\Entity\Discussion
sub-entity (seeSwader\Diffbot\Entity\Article::getDiscussion
).
getType¶
Swader\Diffbot\Entity\Article::
getType
()¶
Returns: string Will always return “article” for articles:
// ... API setup ... // $result = $api->call(); echo $result->getType(); // "article"
getText¶
Swader\Diffbot\Entity\Article::
getText
()¶
Returns: string | null Returns the plaintext content of the processed article. HTML tags are stripped completely, images are removed. If the text property is missing in the result, returns
null
.
getHtml¶
Swader\Diffbot\Entity\Article::
getHtml
()¶
Returns: string Returns the full HTML content of the article. If the HTML property is missing in the result, returns
null
.
getDate¶
getAuthor¶
Swader\Diffbot\Entity\Article::
getAuthor
()¶
Returns: string | null Returns the name of the author as written on the page. If Diffbot was unable to figure out who the author is,
null
is returned.
getTags¶
Swader\Diffbot\Entity\Article::
getTags
()¶
Returns: array Returns an array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia and other data sources. Note that these are not the meta tags as defined by the author, but machine learned ones:
// ... API setup ... // // URL: "http://www.sitepoint.com/diffbot-crawling-visual-machine-learning" // $result = $api->call(); echo count($result->tags); // 5 var_dump($result->tags); /** Output: array (size=5) 0 => array (size=4) 'count' => int 1 'score' => float 0.62 'label' => string 'Machine learning' (length=16) 'uri' => string 'http://dbpedia.org/resource/Machine_learning' (length=44) 1 => array (size=4) 'count' => int 4 'score' => float 0.61 'label' => string 'Web crawler' (length=11) 'uri' => string 'http://dbpedia.org/resource/Web_crawler' (length=39) 2 => array (size=4) 'count' => int 4 'score' => float 0.59 'label' => string 'Lexical analysis' (length=16) 'uri' => string 'http://dbpedia.org/resource/Lexical_analysis' (length=44) 3 => array (size=4) 'count' => int 7 'score' => float 0.54 'label' => string 'Uniform resource locator' (length=24) 'uri' => string 'http://dbpedia.org/resource/Uniform_resource_locator' (length=52) 4 => array (size=5) 'count' => int 2 'score' => float 0.52 'label' => string 'JavaScript' (length=10) 'rdfTypes' => array (size=3) 0 => string 'http://dbpedia.org/ontology/ProgrammingLanguage' (length=47) 1 => string 'http://dbpedia.org/ontology/Software' (length=36) 2 => string 'http://dbpedia.org/ontology/Work' (length=32) 'uri' => string 'http://dbpedia.org/resource/JavaScript' (length=38) **/Returns a maximum of 5 by default, though this can be changed in
Swader\Diffbot\Api\Article::setMaxTags
.
getNumPages¶
Swader\Diffbot\Entity\Article::
getNumPages
()¶
Returns: int Returns the number of pages if the article is a multi-page one. Read about auto-concatenation here and study the
Swader\Diffbot\Api\Article::setPaging
method for more details.
getNextPages¶
Swader\Diffbot\Entity\Article::
getNextPages
()¶
Returns: array If the article is a multi-page one, returns the list of absolute URLs of the pages that follow after the one that was processed. If the article is a single-page one, an empty array is returned.
getSentiment¶
Swader\Diffbot\Entity\Article::
getSentiment
()¶
Returns: float | null Returns the sentiment score of the analyzed article text, a value ranging from -1.0 (very negative) to 1.0 (very positive). If sentiment score is absent (due to Diffbot being unable to determine it, or due to
Swader\Diffbot\Api\Article::setSentiment
being set tofalse
, returnsnull
.
getDiscussion¶
Swader\Diffbot\Entity\Article::
getDiscussion
()¶
Returns: Swader\Diffbot\Entity\Discussion
| nullReturns the
Swader\Diffbot\Entity\Discussion
found on the article’s page (comments section). SeeSwader\Diffbot\Api\Article::setDiscussion
for details and below for usage:use Swader\Diffbot\Diffbot; $url = "www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/"; $diffbot = new Diffbot("my_token"); $api = $diffbot->createArticleApi($url); $result = $api->call(); echo $result->getDiscussion()->getNumPosts(); // 7 echo $result->getDiscussion()->getProvider(); // DisqusFor other methods exposed on the
Swader\Diffbot\Entity\Discussion
entity, see its documentation.
getImages¶
Swader\Diffbot\Entity\Article::
getImages
()¶
Returns: array An array of images found in the article, with their details. The elements of the array are arrays like this one:
/** array (size=7) 'height' => int 512 'diffbotUri' => string 'image|3|-851701004' (length=18) 'naturalHeight' => int 727 'width' => int 749 'primary' => boolean true 'naturalWidth' => int 1063 'url' => string 'http://dab1nmslvvntp.cloudfront.net/wp-content/uploads/2014/07/140624455201.png' (length=79) **/Unlike the
Swader\Diffbot\Api\Discussion
API which returns details about discussion posts even when used with theSwader\Diffbot\Api\Article
API, the image data returned with this method is minimal. For fuller details about images, use theSwader\Diffbot\Api\Image
API.
Product API¶
The Product API is used to parse pages representing products. These can be anything from eBay auction pages and books on Amazon, to leashes and collars in “mom and pop’s pet web shop”.
The Product API will attempt to recognize some of the most popular product-related fields in any given product page, including but not limited to:
- price
- discount
- availability status
- stock level
- characteristics / stats (like smartphone capacity, battery life, network type...)
- reviews
- unique identification number like SKU / ISBN / MPN / UPC...
- and much more...
For a more thorough walk through the product API, see the official docs and demo.
The Product API part of the Diffbot PHP client consists of two main classes: the API class, and the Product Entity class. We’ll describe them in order. Note that the API class extends Swader\Diffbot\Abstracts\Api
, so be sure to read that first if you haven’t already.
Product API Class¶
-
class
Swader\Diffbot\Api\
Product
¶
Basic Usage:
use Swader\Diffbot\Diffbot;
$url = 'http://some-product-to-process.com';
$diffbot = new Diffbot('my_token');
$api = $diffbot->createProductApi($url);
setDiscussion¶
Swader\Diffbot\Api\Product::
setDiscussion
($bool = true)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
Whether or not to use the Discussion API to additionally process any detected comment or review threads on the product page. Behaves as if the
Swader\Diffbot\Api\Discussion
was set to process the page, and merges the returned data with the Product API’s results by means of adiscussion
field in the result. The field will have all the sub-fields of the usualSwader\Diffbot\Api\Discussion
call; i.e. you will be able to access theSwader\Diffbot\Entity\Discussion
entity and all its sub entities via theSwader\Diffbot\Entity\Product::getDiscussion
method.
setColors¶
Swader\Diffbot\Api\Product::
setColors
($bool)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
If set to
true
, the Product API will try to find out the color options of the product, if available. This feature is experimental and often fails even when color options are obvious.
setAvailability¶
Swader\Diffbot\Api\Product::
setAvailability
($bool)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
If set to
true
, Diffbot will attempt to find out whether or not the product in question is available / in stock.
setSize¶
Swader\Diffbot\Api\Product::
setSize
($bool)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
If set to
true
, Diffbot will attempt to find out which sizes the product is offered in. Similar toSwader\Diffbot\Api\Product::setColors
, this method is unreliable and highly experimental.
Product Entity Class¶
When the Product API is done processing a product (or several) the result will be a Product Entity (i.e. a collection of one Product Entities inside an instance of Swader\Diffbot\Entity\EntityIterator
).
For an overview of the abstract class all Entities build on, see Swader\Diffbot\Abstracts\Entity
.
Note that the Product entity can also be returned by the Swader\Diffbot\Api\Analyze
API in “product” mode, or in default mode when processing a URL that contains a product (auto-determined).
-
class
Swader\Diffbot\Entity\
Product
¶
__construct¶
Swader\Diffbot\Entity\Product::
__construct
(array $data)¶
Parameters:
- $data (array) – The data from which to build the Product entity
The Product entity’s constructor needs the data to populate its properties (see getters below). This class is automatically instantiated after a
Swader\Diffbot\Api\Product
orSwader\Diffbot\Api\Analyze
call. You probably won’t ever need to manually create an instance of this class.In the case of the Product entity, the constructor differs from the abstract one (
Swader\Diffbot\Abstracts\Api::__construct
) in that it also looks for the discussion key in the result, in order to build aSwader\Diffbot\Entity\Discussion
sub-entity (seeSwader\Diffbot\Entity\Product::getDiscussion
).
getType¶
Swader\Diffbot\Entity\Product::
getType
()¶
Returns: string Will always return “product” for products:
// ... API setup ... // $result = $api->call(); echo $result->getType(); // "product"
getText¶
Swader\Diffbot\Entity\Product::
getText
()¶
Returns: string | null Returns the plaintext content of the processed product page. HTML tags are stripped completely, images are removed. If the text property is missing in the result, returns
null
.
getRegularPrice¶
Swader\Diffbot\Entity\Product::
getRegularPrice
()¶
Returns: string Returns regular price as string, e.g. “$23.99” or “32 kn”. If not found, returns offerPrice instead - see
Swader\Diffbot\Entity\Product::getOfferPrice
.
getRegularPriceDetails¶
Swader\Diffbot\Entity\Product::
getRegularPriceDetails
()¶
Returns: array Separates regularPrice into components like currency, amount, and full string. If not found, serves as alias for
Swader\Diffbot\Entity\Product::getOfferPriceDetails
.Usage:
// ... API setup ... // $result = $api->call(); var_dump($result->getRegularPriceDetails()); /** array (size=3) 'amount' => float 49.85 'text' => string '£49.85' (length=7) 'symbol' => string '£' (length=2) **/
getShippingAmount¶
Swader\Diffbot\Entity\Product::
getShippingAmount
()¶
Returns: string Returns shipping price as string, e.g. “$5.99”.
getSaveAmount¶
Swader\Diffbot\Entity\Product::
getSaveAmount
()¶
Returns: string Returns difference between regular price and offer price, as string, e.g. “$5.99”.
getSaveAmountDetails¶
Swader\Diffbot\Entity\Product::
getSaveAmountDetails
()¶
Returns: array Separates saveAmount into components like currency, amount, and full string, much like
Swader\Diffbot\Entity\Product::getRegularPriceDetails
. One of the array keys is also a flag indicating whether or not the save amount is a percentage value.Usage:
// ... API setup ... // $result = $api->call(); var_dump($result->getSaveAmountDetails()); /** array (size=4) 'amount' => float 13.5 'text' => string '£13.50' (length=7) 'symbol' => string '£' (length=2) 'percentage' => boolean false **/
getProductId¶
Swader\Diffbot\Entity\Product::
getProductId
()¶
Returns: string | null Diffbot-determined unique product ID. If upc, isbn, mpn or sku are identified on the page, productId will select from these values in the above order. Null if none found.
getSku¶
Swader\Diffbot\Entity\Product::
getSku
()¶
Returns: string | null Returns Stock Keeping Unit – store/vendor inventory number or identifier if available. If not, returns null.
getSpecs¶
Swader\Diffbot\Entity\Product::
getSpecs
()¶
Returns: array If a specifications table or similar data is available on the product page, individual specifications will be returned in the specs object as name/value pairs. Names will be normalized to lowercase with spaces replaced by underscores, e.g. display_resolution.
If no specs table is found, an empty array will be returned.
getImages¶
Swader\Diffbot\Entity\Product::
getImages
()¶
Returns: array An array of images found on the product page, with their details. The elements of the array are arrays like this one:
/** array (size=7) 'height' => int 512 'diffbotUri' => string 'image|3|-851701004' (length=18) 'naturalHeight' => int 727 'width' => int 749 'primary' => boolean true 'naturalWidth' => int 1063 'url' => string 'http://dab1nmslvvntp.cloudfront.net/wp-content/uploads/2014/07/140624455201.png' (length=79) **/Unlike the
Swader\Diffbot\Api\Discussion
API which returns details about discussion posts even when used with theSwader\Diffbot\Api\Product
API, the image data returned with this method is minimal. For fuller details about images, use theSwader\Diffbot\Api\Image
API.
getPrefixCode¶
Swader\Diffbot\Entity\Product::
getPrefixCode
()¶
Returns: string | null Country of origin as identified by UPC/ISBN, e.g. “United Kingdom”. Null if not present.
getProductOrigin¶
Swader\Diffbot\Entity\Product::
getProductOrigin
()¶
Returns: string If available, two-character ISO country code where the product was produced (e.g. “gb”). Null if not present.
getPriceRange¶
Swader\Diffbot\Entity\Product::
getPriceRange
()¶
Returns: array | null If the product is available in a range of prices, the minimum and maximum values will be returned. The lowest price will also be returned as the offerPrice (see
Swader\Diffbot\Entity\Product::getOfferPrice
). If no range is detected, returns null.
getQuantityPrices¶
Swader\Diffbot\Entity\Product::
getQuantityPrices
()¶
Returns: array | null If the product is available with quantity-based discounts, all identifiable price points will be returned. The lowest price will also be returned as the offerPrice (see
Swader\Diffbot\Entity\Product::getOfferPrice
). If no range is detected, returns null.
isAvailable¶
Swader\Diffbot\Entity\Product::
isAvailable
()¶
Returns: bool | null Tries to determine whether or not the product is available / in stock. Returns boolean if determined, or null if not.
getOfferPrice¶
Swader\Diffbot\Entity\Product::
getOfferPrice
()¶
Returns: string Returns price as string, e.g. “£49.85” or “32 kn”.
getOfferPriceDetails¶
Swader\Diffbot\Entity\Product::
getOfferPriceDetails
()¶
Returns: array Separates offerPrice into components like currency, amount, and full string.
Usage:
// ... API setup ... // $result = $api->call(); var_dump($result->getOfferPriceDetails()); /** array (size=3) 'amount' => float 49.85 'text' => string '£49.85' (length=7) 'symbol' => string '£' (length=2) **/
getSize¶
Swader\Diffbot\Entity\Product::
getSize
()¶
Returns: array | null If product is available in different sizes, returns array of those sizes. Highly experimental and often unreliable. This field is optional, and needs to be set on the API. See
Swader\Diffbot\Api\Product::setSize
.
getColors¶
Swader\Diffbot\Entity\Product::
getColors
()¶
Returns: array | null If the product is available in multiple colors, returns the color options. Highly experimental and often unreliable. This field is optional, and needs to be set on the API. See
Swader\Diffbot\Api\Product::setColors
.
getBrand¶
Swader\Diffbot\Entity\Product::
getBrand
()¶
Returns: string The brand of the product, as determined by Diffbot.
getDiscussion¶
Swader\Diffbot\Entity\Product::
getDiscussion
()¶
Returns: Swader\Diffbot\Entity\Discussion
| nullReturns the
Swader\Diffbot\Entity\Discussion
found on the product’s page (review section). SeeSwader\Diffbot\Api\Product::setDiscussion
for details and below for usage:use Swader\Diffbot\Diffbot; $url = "http://www.sportsdirect.com/slazenger-plain-polo-shirt-mens-542006?colcode=54200601"; $diffbot = new Diffbot("my_token"); $api = $diffbot->createProductApi($url); $result = $api->call(); echo $result->getDiscussion()->getNumPosts(); // 10 echo $result->getDiscussion()->getParticipants(); // 10For other methods exposed on the
Swader\Diffbot\Entity\Discussion
entity, see its documentation.
Discussion API¶
This API is used to turn content like product reviews, comments on posts and forum threads into JSON. This API can be unleashed onto a forum / comment thread directly, or onto a product page / article page containing comments / reviews.
The Discussion API part of the Diffbot PHP client consists of three main classes: the API class, the Discussion Entity class, and the Post Entity class. We’ll describe them in order. Note that the API class extends Swader\Diffbot\Abstracts\Api
, so be sure to read that first if you haven’t already.
Discussion API Class¶
-
class
Swader\Diffbot\Api\
Discussion
¶
Basic Usage:
use Swader\Diffbot\Diffbot;
$url = 'http://some-article-to-process.com';
$diffbot = new Diffbot('my_token');
$api = $diffbot->createDiscussionApi($url);
setMaxPages¶
Swader\Diffbot\Api\Discussion::
setMaxPages
($max = 1)¶
Parameters:
- $max (int) – max number of pages to fetch
Returns: $this
Set the maximum number of pages in a thread to automatically concatenate in a single response. Default = 1 (no concatenation). Set maxPages=all to retrieve all pages of a thread regardless of length. Each individual page will count as a separate API call.
setSentiment¶
Swader\Diffbot\Api\Discussion::
setSentiment
($bool)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
This method sets the
sentiment
optional field value. This determines whether or not to return the sentiment score of the analyzed posts (each individual post gets one), a value ranging from -1.0 (very negative) to 1.0 (very positive). Sentiment analysis is powered by Semantria for advanced features like keyword and entity extraction, but the basic sentiment analysis (score only) is enabled for everyone, even those without Semantria accounts.Usage:
$url = 'https://www.reddit.com/r/PHP/comments/3nl7g1/authentication_flow_in_a_microservice_architecture/'; // ... $api->setSentiment(true); $result = $api->cal(); // ... echo $result->getPosts()[0]->getSentiment(); // -0.0789
Discussion Entity Class¶
When the Discussion API is done processing a URL, the result will be a Discussion Entity (i.e. a collection of one Discussion Entities inside an instance of Swader\Diffbot\Entity\EntityIterator
).
For an overview of the abstract class all Entities build on, see Swader\Diffbot\Abstracts\Entity
.
-
class
Swader\Diffbot\Entity\
Discussion
¶
__construct¶
Swader\Diffbot\Entity\Discussion::
__construct
(array $data)¶
Parameters:
- $data (array) – The data from which to build the Discussion object
The Article entity’s constructor needs the data to populate its properties (see getters below). This class is automatically instantiated after a
Swader\Diffbot\Api\Discussion
call. You probably won’t ever need to manually create an instance of this class.Like
Swader\Diffbot\Entity\Product
andSwader\Diffbot\Entity\Article
, the Discussion entity also has its own custom constructor, looking for theposts
key inside of the return data, in order to create some nestedSwader\Diffbot\Entity\Post
objects.
getType¶
Swader\Diffbot\Entity\Discussion::
getType
()¶
Returns: string Will always return “discussion” for discussions:
// ... API setup ... // $result = $api->call(); echo $result->getType(); // "discussion"
getNumPosts¶
Swader\Diffbot\Entity\Discussion::
getNumPosts
()¶
Returns: int Returns the number of posts found in the discussion. Only returns the number of posts in the fetched page range, so even if there are 100 posts over 20 pages, this method will return 5 if
Swader\Diffbot\Api\Discussion::setMaxPages
is still set to 1.
getTags¶
Swader\Diffbot\Entity\Discussion::
getTags
()¶
Returns: array Returns an array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia and other data sources. Note that these are not the meta tags as defined in the page’s
<head>
, but machine learned ones:// ... API setup ... // $url = 'https://www.reddit.com/r/PHP/comments/3nl7g1/authentication_flow_in_a_microservice_architecture/'; // ... $result = $api->call(); echo count($result->tags); // 5 var_dump($result->getTags); /** Output: array (size=5) 0 => array (size=5) 'count' => int 5 'prevalence' => float 0.11 'score' => float 0.11 'label' => string 'User (computing)' (length=16) 'uri' => string 'http://dbpedia.org/resource/User_(computing)' (length=44) 1 => array (size=5) 'count' => int 4 'prevalence' => float 0.09 'score' => float 0.09 'label' => string 'Hypertext Transfer Protocol' (length=27) 'uri' => string 'http://dbpedia.org/resource/Hypertext_Transfer_Protocol' (length=55) 2 => array (size=5) 'count' => int 3 'prevalence' => float 0.07 'score' => float 0.07 'label' => string 'POST (HTTP)' (length=11) 'uri' => string 'http://dbpedia.org/resource/POST_(HTTP)' (length=39) 3 => array (size=5) 'count' => int 2 'prevalence' => float 0.04 'score' => float 0.04 'label' => string 'Object (computer science)' (length=25) 'uri' => string 'http://dbpedia.org/resource/Object_(computer_science)' (length=53) 4 => array (size=5) 'count' => int 2 'prevalence' => float 0.04 'score' => float 0.04 'label' => string 'Coupling' (length=8) 'uri' => string 'http://dbpedia.org/resource/Coupling' (length=36) **/Returns a maximum of 5.
getParticipants¶
Swader\Diffbot\Entity\Discussion::
getParticipants
()¶
Returns: int The number of unique participants in the discussion.
getNumPages¶
Swader\Diffbot\Entity\Discussion::
getNumPages
()¶
Returns: int Returns the number of pages if the discussion is a multi-page one. Read about auto-concatenation here and study the
Swader\Diffbot\Api\Discussion::setMaxPages
method for more details.
getNextPages¶
Swader\Diffbot\Entity\Discussion::
getNextPages
()¶
Returns: array If the discussion is a multi-page one, returns the list of absolute URLs of the pages that follow after the one that was processed. If the discussion is a single-page one, an empty array is returned.
getNextPage¶
Swader\Diffbot\Entity\Discussion::
getNextPage
()¶
Returns: string | null If the discussion is a multi-page one, returns the absolute subsequent page URL.
getProvider¶
Swader\Diffbot\Entity\Discussion::
getProvider
()¶
Returns: string | null Returns the provider of the comment / review system. This will be something like “disqus”, “facebook”, etc. In cases of forums and similar all-encompassing systems like Reddit, this method will return null.
getRssUrl¶
Swader\Diffbot\Entity\Discussion::
getRssUrl
()¶
Returns: string | null Returns the RSS feed URL for the discussion, if available.
getConfidence¶
Swader\Diffbot\Entity\Discussion::
getConfidence
()¶
Returns: float | null A number from -1 to 1. Not sure what it does. Waiting for feedback from HQ. @todo find out what this is.
getPosts¶
Swader\Diffbot\Entity\Discussion::
getPosts
()¶
Returns: array Returns an array of
Swader\Diffbot\Entity\Post
objects, each built around the data in every individual post of a discussion. For post accessor methods, see below.
Discussion Post Class¶
-
class
Swader\Diffbot\Entity\
Post
¶
Every Discussion entity has children - its posts. Every Post is its own entity, and very similar to Swader\Diffbot\Entity\Article
, sharing many of its methods.
getLang¶
Swader\Diffbot\Entity\Post::
getLang
()¶
Returns: string Returns the language code of the detected language of the processed content. The code returned is a two-character ISO 639-1 code: http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
getHumanLanguage¶
Swader\Diffbot\Entity\Post::
getHumanLanguage
()¶
Returns: string Alias method for
getLang()
above.
getText¶
Swader\Diffbot\Entity\Post::
getText
()¶
Returns: string | null Returns the plaintext content of the processed post. HTML tags are stripped completely, images are removed. If the text property is missing in the result, returns
null
.
getHtml¶
Swader\Diffbot\Entity\Post::
getHtml
()¶
Returns: string Returns the full HTML content of the post. If the HTML property is missing in the result, returns
null
.
getDate¶
getAuthor¶
Swader\Diffbot\Entity\Post::
getAuthor
()¶
Returns: string | null Returns the name of the author as written on the page. If Diffbot was unable to figure out who the author is,
null
is returned.
getAuthorUrl¶
Swader\Diffbot\Entity\Post::
getAuthorUrl
()¶
Returns: string | null If the author’s profile URL could be determined, this method will return it.
getTags¶
Swader\Diffbot\Entity\Post::
getTags
()¶
Returns: array Returns an array of tags/entities, generated from analysis of the extracted text and cross-referenced with DBpedia and other data sources. Note that these are not the meta tags as defined by the author, but machine learned ones. Same thing as
Swader\Diffbot\Entity\Article::getTags
andSwader\Diffbot\Entity\Discussion::getTags
.
getSentiment¶
Swader\Diffbot\Entity\Post::
getSentiment
()¶
Returns: float | null Returns the sentiment score of the analyzed post text, a value ranging from -1.0 (very negative) to 1.0 (very positive). If sentiment score is absent (due to Diffbot being unable to determine it, or due to
Swader\Diffbot\Api\Discussion::setSentiment
being set tofalse
, returnsnull
.
getVotes¶
Swader\Diffbot\Entity\Post::
getVotes
()¶
Returns: int If a voting system exists and is easily discernible, Diffbot returns the number of upvotes on the post.
getId¶
Swader\Diffbot\Entity\Post::
getId
()¶
Returns: int Returns the ID of the post (usually the ordinary number of the post in the list of all posts, starting with 0 for the first one.
getParentId¶
Swader\Diffbot\Entity\Post::
getParentId
()¶
Returns: int | null If the post is a reply, this is the ID of the post it replies to. If not, null.
getImages¶
Swader\Diffbot\Entity\Post::
getImages
()¶
Returns: array An array of images found in the post, with their details. The elements of the array are arrays like this one:
/** array (size=7) 'height' => int 512 'diffbotUri' => string 'image|3|-851701004' (length=18) 'naturalHeight' => int 727 'width' => int 749 'primary' => boolean true 'naturalWidth' => int 1063 'url' => string 'http://dab1nmslvvntp.cloudfront.net/wp-content/uploads/2014/07/140624455201.png' (length=79) **/The image data returned with this method is minimal. For fuller details about images, use the
Swader\Diffbot\Api\Image
API.
Image API¶
This API is used to turn content like image galleries, Instagram posts, or image-rich articles into JSON.
For examples of data that might be returned, please see http://diffbot.com and run the Image API demo.
The Image API part of the Diffbot PHP client consists of two main classes: the API class, and the Image Entity class. We’ll describe them in order. Note that the API class extends Swader\Diffbot\Abstracts\Api
, so be sure to read that first if you haven’t already.
Image API Class¶
-
class
Swader\Diffbot\Api\
Image
¶
Basic Usage:
use Swader\Diffbot\Diffbot;
$url = 'http://some-article-to-process.com';
$diffbot = new Diffbot('my_token');
$api = $diffbot->createImageApi($url);
setMentions¶
Swader\Diffbot\Api\Image::
setMentions
($bool)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
If set to true, the Image API will attempt to identify other locations online where the image was used - similar to Google Image reverse search.
Image Entity Class¶
When the Image API is done processing a URL, the result will be an instance of Swader\Diffbot\Entity\EntityIterator
containing one Image Entity instance for every image found.
For an overview of the abstract class all Entities (including Image) build on, see Swader\Diffbot\Abstracts\Entity
.
Note that the Image entities can also be returned by the Swader\Diffbot\Api\Analyze
API in “image” mode, or in default mode when processing a URL that is essentially an image.
-
class
Swader\Diffbot\Entity\
Image
¶
getType¶
Swader\Diffbot\Entity\Image::
getType
()¶
Returns: string Will always return “image” for images:
// ... API setup ... // $result = $api->call(); echo $result->getType(); // "image"
getHeight¶
Swader\Diffbot\Entity\Image::
getHeight
()¶
Returns: int Height of image if resized by browser via CSS / JS. If not resized, serves as alias for
Swader\Diffbot\Entity\Image::getNaturalHeight
.
getWidth¶
Swader\Diffbot\Entity\Image::
getWidth
()¶
Returns: int Width of image if resized by browser via CSS / JS. If not resized, serves as alias for
Swader\Diffbot\Entity\Image::getNaturalWidth
.
getNaturalHeight¶
Swader\Diffbot\Entity\Image::
getNaturalHeight
()¶
Returns: int Raw image height, in pixels.
getNaturalWidth¶
Swader\Diffbot\Entity\Image::
getNaturalWidth
()¶
Returns: int Raw image width, in pixels.
getAnchorUrl¶
Swader\Diffbot\Entity\Image::
getAnchorUrl
()¶
Returns: string | null URL the image links to, if any. Null if image isn’t linked.
getXPath¶
Swader\Diffbot\Entity\Image::
getXPath
()¶
Returns: string The XPath expression of the position of the image node in the DOM.
getMentions¶
Swader\Diffbot\Entity\Image::
getMentions
()¶
Returns: array Returns an array of [title => “title”, link => “link”] arrays for all posts where this image, or a similar one, was found. If not found, returns empty array.
getFaces¶
Swader\Diffbot\Entity\Image::
getFaces
()¶
Returns: array | string Finds the x, y, height and width of coordinates of human faces, returns array of arrays with those keys. In most cases, does not work at all and is in heavy alpha mode. Do not rely on this method for anything. Returns empty string if nothing found.
Analyze API¶
This API is a sort of “catch all” for all other API types in that it automatically determines the type of content being processed, and applies the appropriate API call to it.
This API will return entities matching the determined content type. For example, if you run Analyze API on a URL like www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/
, the content type will be determined as “article” and it’ll be exactly as if you had called the Article API (Swader\Diffbot\Api\Article
) on it.
Analyze API Class¶
-
class
Swader\Diffbot\Api\
Analyze
¶
setDiscussion¶
Swader\Diffbot\Api\Analyze::
setDiscussion
($bool = true)¶
Parameters:
- $bool (bool) – Either
true
orfalse
Returns: $this
If set to false, will not extract article comments in a Discussion entity embedded in the Article / Product entity. By default, it will.
setMode¶
Swader\Diffbot\Api\Analyze::
setMode
($mode)¶
Parameters:
- $mode (string) – “article”, “product”, “image” or “auto”
Returns: $this
By default the Analyze API will fully extract all pages that match an existing Automatic API – articles, products or image pages. Set mode to a specific page-type (e.g., mode=article) to extract content only from that specific page-type. All other pages will simply return the default Analyze fields.
Usage with defaults:
use Swader\Diffbot\Diffbot;
$url = "www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/";
$diffbot = new Diffbot("my_token");
$api = $diffbot->createAnalyzeApi($url);
$result = $api->call();
echo $result->getAuthorUrl(); // "http://www.sitepoint.com/author/bskvorc/"
echo $result->getDiscussion()->getNumPosts(); // 7
echo $result->getDiscussion()->getProvider(); // Disqus
Usage with discussion off:
use Swader\Diffbot\Diffbot;
$url = "www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/";
$diffbot = new Diffbot("my_token");
$api = $diffbot->createAnalyzeApi($url);
$api->setDiscussion(false);
$result = $api->call();
echo $result->getAuthorUrl(); // "http://www.sitepoint.com/author/bskvorc/"
var_dump($result->getDiscussion()); // null
Usage with non-matching mode:
use Swader\Diffbot\Diffbot;
$url = "www.sitepoint.com/quick-tip-get-homestead-vagrant-vm-running/";
$diffbot = new Diffbot("my_token");
$api = $diffbot->createAnalyzeApi($url);
$api->setMode("image");
$result = $api->call();
echo $result->getAuthorUrl(); // null
var_dump($result->getDiscussion()); // null
In the last example above, no data is available due to a mismatch in mode - using image parsing on an article entity does not produce any useful information.
Custom API¶
The Custom API is user defined in the Diffbot UI.
For a tutorial on creating a Custom API in the Diffbot UI, see here.
Custom API Class¶
-
class
Swader\Diffbot\Api\
Custom
¶ When you have a Custom API ready on Diffbot’s end, you instantiate the Custom API class and pass in the Custom API name, along with the URL to process. Everything from that point on is identical to the other APIs, except the fact that instead of specific entities being returned, all Custom API calls return an iterator of
Swader\Diffbot\Entity\Wildcard
entities.
__construct¶
Swader\Diffbot\Api\Custom::
__construct
($url, $name)¶
Parameters:
- $url (string) – The URL to process
- $name (string) – The name of the API
The construct method is identical to the one in
Swader\Diffbot\Abstracts\Api
with one difference - it also needs the name of the Custom API in question, so that it can build the API URL to which the call will be dispatched whenSwader\Diffbot\Abstracts\Api::call
is called:<?php require_once '../vendor/autoload.php'; use Swader\Diffbot\Diffbot; $diffbot = new Diffbot($my_token); $url = 'http://sitepoint.com/author/bskvorc'; $api = $diffbot->createCustomApi($url, "AuthorFolio"); $result = $api->call(); echo $result->getBio(); // "Bruno is a coder from Croatia..."In the example above, AuthorFolio is a custom API from this tutorial, which processes a SitePoint author’s portfolio. The
getBio
call works because of the magic methods inSwader\Diffbot\Abstracts\Entity
whichSwader\Diffbot\Entity\Wildcard
inherits.
Wildcard Entity Class¶
-
class
Swader\Diffbot\Entity\
Wildcard
¶ The Wildcard entity is returned when the type of a processed post does not match a type defined in the currently set EntityFactory (see
Swader\Diffbot\Factory\Entity
andSwader\Diffbot\Diffbot::setEntityFactory
).It is nothing more than a concretization of
Swader\Diffbot\Abstracts\Entity
and as such contains no additional methods.In the example above, the
getBio
method is called on a Wilcard instance, returned by the call to the AuthorFolio. custom API.
Crawl API¶
Diffbot has the ability to crawl entire domains and process all crawled pages. For a difference between crawling and processing see here.
To programmatically create or update crawljobs, use this API.
A full tutorial on using this API can be found here, and a working app powered by it at http://search.sitepoint.tools.
The Crawl API is also known as the Crawlbot.
Crawl API Class¶
-
class
Swader\Diffbot\Api\
Crawl
¶
The Crawl API is used to create new crawljobs or modify existing ones. The Crawl API is atypical, and as such does not extend Swader\Diffbot\Abstracts\Api
unlike the more entity-specific APIs.
Note that everything you can do with the Crawl API can also be done in the Diffbot UI.
__construct¶
Swader\Diffbot\Api\Crawl::
__construct
($name = null, $api = null)¶
Parameters:
- $name (string) – [Optional] The name of crawljob to be created or modified.
- $api (Swader\Diffbot\Interfaces\Api) – [Optional] The API to use while processing the crawled links.
The
$name
argument is optional. If omitted, the second argument is ignored and theSwader\Diffbot\Api\Crawl::call
will return a list of all crawljobs on a given Diffbot token, with their information, in aSwader\Diffbot\Entity\EntityIterator
collection ofSwader\Diffbot\Entity\JobCrawl
instances.The
$api
argument is also optional, but must be an instance ofSwader\Diffbot\Interfaces\Api
if provided:<?php // ... set up Diffbot $api = $diffbot->createArticleApi('crawl'); $crawljob = $diffbot->crawl('myCrawlJob', $api); // ... crawljob setup // $crawljob->setSeeds( ... ) $crawljob->call();
getName¶
Swader\Diffbot\Api\Crawl::
getName
()¶
Returns: string Returns the unique name of the crawljob. This name is later used to download datasets, or to modify the job.
setApi¶
Swader\Diffbot\Api\Crawl::
setApi
($api)¶
Parameters:
- $api (Swader\Diffbot\Interfaces\Api) – An instance of
Swader\Diffbot\Interfaces\Api
to process all crawled links.Returns: $this
The API cannot be modified after a crawljob has been created. This method is useless on existing crawljobs (see https://www.diffbot.com/dev/docs/crawl/api.jsp)
The
$api
passed into this class will be used on Diffbot’s end to process all the pages the crawljob provides. For example, if you set http://sitepoint.com as the seed URL (seeSwader\Diffbot\Api\Crawl::setSeeds
), and an instance of theSwader\Diffbot\Api\Article
API as the$api
argument, all pages found on http://sitepoint.com will be processed with the Article API. The results won’t be returned - rather, they’ll be saved on Diffbot’s servers for searching later (seeSwader\Diffbot\Api\Search
).The other APIs require a URL parameter in their constructor, but when crawling, it is crawlbot who is providing the URLs. To get around this requirement, use the string “crawl” instead of a URL when instantiating a new API for use with the Crawl API:
// ... $api = $diffbot->createArticleApi('crawl'); // ...
setSeeds¶
Swader\Diffbot\Api\Crawl::
setSeeds
(array $seeds)¶
Parameters:
- $seeds (array) – An array of URLs (seeds) which to crawl for matching links
Returns: $this
By default Crawlbot will restrict spidering to the entire domain (“http://blog.diffbot.com” will include URLs at “http://www.diffbot.com”):
// ... $crawljob->setSeeds(['http://sitepoint.com', 'http://blog.diffbot.com']); // ...
setUrlCrawlPatterns¶
Swader\Diffbot\Api\Crawl::
setUrlCrawlPatterns
(array $pattern = null)¶
Parameters:
- $pattern (array) – [Optional] Array of strings to limit pages crawled to those whose URLs contain any of the content strings.
Returns: $this
You can use the exclamation point to specify a negative string, e.g. !product to exclude URLs containing the string “product,” and the ^ and $ characters to limit matches to the beginning or end of the URL.
The use of a urlCrawlPattern will allow Crawlbot to spider outside of the seed domain(s); it will follow all matching URLs regardless of domain:
// ... $crawljob->setUrlCrawlPatterns(['!author', '!page']); // ...
setUrlCrawlRegex¶
Swader\Diffbot\Api\Crawl::
setUrlCrawlRegex
($regex)¶
Parameters:
- $regex (string) – a regular expression string
Returns: $this
Specify a regular expression to limit pages crawled to those URLs that match your expression. This will override any urlCrawlPattern value.
The use of a urlCrawlRegEx will allow Crawlbot to spider outside of the seed domain; it will follow all matching URLs regardless of domain.
setUrlProcessPatterns¶
Swader\Diffbot\Api\Crawl::
setUrlProcessPatterns
(array $pattern = null)¶
Parameters:
- $pattern (array) – [Optional] array of strings to search for in URLs
Returns: $this
Only URLs containing one or more of the strings specified will be processed by Diffbot. You can use the exclamation point to specify a negative string, e.g. !/category to exclude URLs containing the string “/category,” and the ^ and $ characters to limit matches to the beginning or end of the URL.
setUrlProcessRegex¶
Swader\Diffbot\Api\Crawl::
setUrlProcessRegex
($regex)¶
Parameters:
- $regex (string) – Regular expression string
Returns: $this
Specify a regular expression to limit pages processed to those URLs that match your expression. This will override any urlProcessPattern value.
setPageProcessPatterns¶
Swader\Diffbot\Api\Crawl::
setPageProcessPatterns
(array $pattern = null)¶
Parameters:
- $pattern (array) – [Optional] Array of strings
Returns: $this
Specify strings to look for in the HTML of the pages of the crawled URLs. Only pages containing one or more of those strings will be processed by the designated API. Very useful for limiting processing to pages with a certain class present (e.g.
class=article
) to further narrow down processing scope and reduce expenses (fewer API calls).
setMaxHops¶
Swader\Diffbot\Api\Crawl::
setMaxHops
($input = -1)¶
Parameters:
- $input (int) – [Optional] Maximum number of hops
Returns: $this
Specify the depth of your crawl. A maxHops=0 will limit processing to the seed URL(s) only – no other links will be processed; maxHops=1 will process all (otherwise matching) pages whose links appear on seed URL(s); maxHops=2 will process pages whose links appear on those pages; and so on. By default, Crawlbot will crawl and process links at any depth.
setMaxToCrawl¶
Swader\Diffbot\Api\Crawl::
setMaxToCrawl
($input = 100000)¶
Parameters:
- $input (type) – [Optional] Maximum number of URLs to spider
Returns: $this
Note that spidering (crawling) does not affect the API quota, and reducing this will only contribute to the length of a crawljob (it will be done faster if the limit is reached sooner). For a difference between crawling and processing see here.
setMaxToProcess¶
notify¶
Swader\Diffbot\Api\Crawl::
notify
($string)¶
Parameters:
- $string (string) – Email or URL
Returns: $this
Throws:
InvalidArgumentException
if the input parameter is not a numberIf input is email address, end a message to this email address when the crawl hits the maxToCrawl or maxToProcess limit, or when the crawl completes.
If input is URL, you will receive a POST with X-Crawl-Name and X-Crawl-Status in the headers, and the full JSON response in the POST body.
This method can be called once with an email and another time with a URL in order to define both an email notification hook and a URL notification hook. An InvalidArgumentException will be thrown if the argument isn’t a valid string (neither a URL nor an email address).
setCrawlDelay¶
Swader\Diffbot\Api\Crawl::
setCrawlDelay
($input = 0.25)¶
Parameters:
- $input (float) – [Optional] delay between crawljob repeat executions, in floating point seconds. Defaults to 0.25 seconds.
Returns: $this
Throws:
InvalidArgumentException
if the input parameter is not a numberWait this many seconds between each URL crawled from a single IP address. Specify the number of seconds as an integer or floating-point number.
setRepeat¶
Swader\Diffbot\Api\Crawl::
setRepeat
($input)¶
Parameters:
- $input (float) – The wait period between crawljob restarts, expressed in floating point days. E.g. 0.5 is 12 hours, 7 is a week, 14.5 is 2 weeks and 12 hours, etc. By default, crawls will not be repeated.
Returns: $this
Throws:
InvalidArgumentException
if the input parameter is not a number
setOnlyProcessIfNew¶
Swader\Diffbot\Api\Crawl::
setOnlyProcessIfNew
($int = 1)¶
Parameters:
- $int (int) – [Optional] a boolean flag represented as an integer
Returns: return value
By default repeat crawls will only process new (previously unprocessed) pages. Set to 0 to process all content on repeat crawls.
setMaxRounds¶
Swader\Diffbot\Api\Crawl::
setMaxRounds
($input = 0)¶
Parameters:
- $input (type) – [Optional] The param’s description
Returns: return value
Specify the maximum number of crawl repeats. By default (maxRounds=0) repeating crawls will continue indefinitely.
setObeyRobots¶
Swader\Diffbot\Api\Crawl::
setObeyRobots
($bool = true)¶
Parameters:
- $bool (bool) – [Optional] Either
true
orfalse
Returns: $this
Ignores robots.txt if set to
false
roundStart¶
Swader\Diffbot\Api\Crawl::
roundStart
($commit = true)¶
Parameters:
- $commit (bool) – [Optional] Either
true
orfalse
Returns: Force the start of a new crawl “round” (manually repeat the crawl). If onlyProcessIfNew is set to 1 (default), only newly-created pages will be processed. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
pause¶
Swader\Diffbot\Api\Crawl::
pause
($commit = true)¶
Parameters:
- $commit (bool) – [Optional] Either
true
orfalse
Returns: Pause a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
unpause¶
Swader\Diffbot\Api\Crawl::
unpause
($commit = true)¶
Parameters:
- $commit (bool) – [Optional] Either
true
orfalse
Returns: Unpause a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
restart¶
Swader\Diffbot\Api\Crawl::
restart
($commit = true)¶
Parameters:
- $commit (bool) – [Optional] Either
true
orfalse
Returns: Restart a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
delete¶
Swader\Diffbot\Api\Crawl::
delete
($commit = true)¶
Parameters:
- $commit (bool) – [Optional] Either
true
orfalse
Returns: Delete a crawljob. The method returns the result of the search if activated, or the current instance of the API class if called without having a truthy value passed in.
buildUrl¶
Swader\Diffbot\Api\Crawl::
buildUrl
()¶
Returns: string This method is called automatically when
Swader\Diffbot\Abstracts\Api::call
is called. It builds the URL which is to be called by the HTTPClient inSwader\Diffbot\Diffbot::setHttpClient
, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.Usage:
$api-> // ... set up API $myUrl = $api->buildUrl();
call¶
Swader\Diffbot\Api\Crawl::
call
()¶
Returns: Swader\Diffbot\Entity\EntityIterator
When the API instance has been fully configured, this method executes the call. If all went well, will return a collection of
Swader\Diffbot\Entity\JobCrawl
objects, each with information about a job under the current Diffbot token. How many get returned depends on the action that was performed - see below.
JobCrawl Class¶
The JobCrawl class is a container of information about a crawljob. If a crawljob is being created with the Crawl API, the Crawl API will return a single instance of JobCrawl with the information about the created job. If the Crawl API is being called without settings, returns all the token’s crawljobs - each in a separate instance. If the crawl job is being deleted, restarted, paused, etc, only the instance pertaining to the relevant crawljob is returned.
-
class
Swader\Diffbot\Entity\
JobCrawl
¶
getMaxToCrawl¶
Swader\Diffbot\Entity\JobCrawl::
getMaxToCrawl
()¶
Returns: int Maximum number of pages to crawl with this crawljob
getMaxToProcess¶
Swader\Diffbot\Entity\JobCrawl::
getMaxToProcess
()¶
Returns: int Maximum number of pages to process with this crawljob
getOnlyProcessIfNew¶
Swader\Diffbot\Entity\JobCrawl::
getOnlyProcessIfNew
()¶
Returns: bool Whether or not the job was set to only process newly found links, ignoring old but potentially updated ones
getSeeds¶
Swader\Diffbot\Entity\JobCrawl::
getSeeds
()¶
Returns: array Seeds as given to the crawljob on creation. Returned as an array, suitable for direct insertion into a new crawljob via
Swader\Diffbot\Api\Crawl::setSeeds
Search API¶
Diffbot’s Search API allows you to search the extracted content of one or all of your Diffbot “collections.” A collection is a discrete Crawlbot (Swader\Diffbot\Api\Crawl
) or Bulk API job, and includes all of the web pages processed within that job.
In order to search a collection, you must first create that collection using either Crawlbot or the Bulk API. A collection can be searched before a crawl or bulk job is finished.
Whereas Crawlbot returns information about a specific crawljob, the Search API returns sets of matching documents from Diffbot’s database, depending on provided query parameters.
The API consists of two parts: the API class used to make the call and return the results, and the SearchInfo class as an alternative result, providing metadata about the query and the complete resultset. We’ll describe both, in order.
Note that the API class extends Swader\Diffbot\Abstracts\Api
, so be sure to read that first if you haven’t already.
Search API Class¶
-
class
Swader\Diffbot\Api\
Search
¶
This API class is a bit specific in that it only extends Swader\Diffbot\Abstracts\Api
to inherit part of a single function - almost everything else is custom implemented, due to the highly specific nature of the API.
Basic usage:
use Swader\Diffbot\Diffbot;
$diffbot = new Diffbot('my_token');
$search = $diffbot->search('author:"Miles Johnson" AND type:article');
$result = $search->call();
foreach ($result as $article) {
echo $article->getTitle();
}
$info = $search->call(true);
echo $info->getHits(); // 50
__construct¶
Swader\Diffbot\Api\Search::
__construct
()¶
Parameters:
- $q (string) – Query string to run on the collection(s)
The constructor takes a string like “foo AND bar AND title:baz”. This would make the API search for documents containing both “foo” and “bar” in any of the fields, and “baz” in the title field.
setCol¶
Swader\Diffbot\Api\Search::
setCol
($col = null)¶
Parameters:
- $col (string) – [Optional] Name of collection to search
Returns: $this
If collection name is not provided, Search API will search all the collections under the currently active token.
setNum¶
Swader\Diffbot\Api\Search::
setNum
($num = 20)¶
Parameters:
- $num (string|int) – Number of results to return
Returns: $this
The
$num
param should either be a number, or the string “all” if you want the API to return all the results. Note that this may be quite a large payload if the search terms are broad, and you’d likely be better off paginating the result (see below).
setStart¶
Swader\Diffbot\Api\Search::
setStart
($start = 0)¶
Parameters:
- $start (int) – The starting result number. Used during pagination.
Returns: $this
buildUrl¶
Swader\Diffbot\Api\Search::
buildUrl
()¶
Returns: string This method is called automatically when
Swader\Diffbot\Abstracts\Api::call
is called. It builds the URL which is to be called by the HTTPClient inSwader\Diffbot\Diffbot::setHttpClient
, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.Usage:
$api-> // ... set up API $myUrl = $api->buildUrl();
call¶
Swader\Diffbot\Api\Search::
call
($info = false)¶
Parameters:
- $info (bool) – Either
true
orfalse
Returns:
Swader\Diffbot\Entity\SearchInfo
|Swader\Diffbot\Entity\EntityIterator
When the API instance has been fully configured, this method executes the call.
If the
$info
parameter passed into the method isfalse
, the return value will be an iterable collection (Swader\Diffbot\Entity\EntityIterator
) of appropriate entities. Refer to each API’s documentation for details on entities returned from each API call.If you pass in
true
, you force info mode and get back aSwader\Diffbot\Entity\SearchInfo
object related to the last call. Keep in mind that passing intrue
before calling a defaultcall()
will implicitly call thecall()
, and then get the SearchInfo.So:
$searchApi->call(); // gets entities $searchApi->call(true); // gets SearchInfo about the executed query
SearchInfo Entity Class¶
When the Search API is called with info mode forced, the API will return an info object, containing various properties useful for pagination and metadata.
-
class
Swader\Diffbot\Entity\
SearchInfo
¶
getType¶
Swader\Diffbot\Entity\SearchInfo::
getType
()¶
Returns: string Will always return “searchInfo”:
// ... API setup ... // $result = $api->call(true); echo $result->getType(); // "searchInfo"
getCurrentTimeUTC¶
Swader\Diffbot\Entity\SearchInfo::
getCurrentTimeUTC
()¶
Returns: int Current UTC time as timestamp
getResponseTimeMS¶
Swader\Diffbot\Entity\SearchInfo::
getResponseTimeMS
()¶
Returns: int Response time in milliseconds. Time it took to process the query on Diffbot’s end.
getNumResultsOmitted¶
Swader\Diffbot\Entity\SearchInfo::
getNumResultsOmitted
()¶
Returns: int Number of results skipped for any reason
getNumShardsSkipped¶
Swader\Diffbot\Entity\SearchInfo::
getNumShardsSkipped
()¶
Returns: int Number of skipped shards (@todo find out what those are)
getTotalShards¶
Swader\Diffbot\Entity\SearchInfo::
getTotalShards
()¶
Returns: int Total number of shards (@todo find out what those are)
getDocsIncollection¶
Swader\Diffbot\Entity\SearchInfo::
getDocsInCollection
()¶
Returns: int Total number of documents in collection. Should resemble the total number you got on the crawl job. (@todo: find out why not identical)
getHits¶
Swader\Diffbot\Entity\SearchInfo::
getHits
()¶
Returns: int Number of results that match - NOT the number of returned results! Use this for pagination as a total result count.
getQueryInfo¶
Swader\Diffbot\Entity\SearchInfo::
getQueryInfo
()¶
Returns: array Returns an assoc. array containing the following keys and example values:
/** "fullQuery" => "type:json AND (author:\"Miles Johnson\" AND type:article)", "queryLanguageAbbr" => "xx", "queryLanguage" => "Unknown", "terms" => [ [ "termNum" => 0, "termStr" => "Miles Johnson", "termFreq" => 2621376, "termHash48" => 224575481707228, "termHash64" => 4150001371756911641, "prefixHash64" => 3732660069076179349 ], [ "termNum" => 1, "termStr" => "type:json", "termFreq" => 2621664, "termHash48" => 272064464231140, "termHash64" => 9877301297136722857, "prefixHash64" => 7586288672657224048 ], [ "termNum" => 2, "termStr" => "type:article", "termFreq" => 524448, "termHash48" => 210861560163398, "termHash64" => 12449358332005671483, "prefixHash64" => 7586288672657224048 ] ] **/@todo: find out what hashes are, and to what the freq is relative
EntityFactory¶
The EntityFactory builds the Swader\Diffbot\Entity\EntityIterator
by providing it with a collection of entities returned by an API, and a Guzzle Response which to consume. It implements the Swader\Diffbot\Interfaces\EntityFactory
interface.
The only reason to build your own version of the EntityFactory is to provide it with instructions on how to pair API return types and entities you developed by extending Swader\Diffbot\Abstracts\Entity
.
For a concrete example of this, see this tutorial on SitePoint, which demonstrates custom “AuthorFolio” and “SitePointArticle” entities automatically created by calls to a custom API.
-
class
Swader\Diffbot\Factory\
Entity
¶
createAppropriateIterator¶
Swader\Diffbot\Factory\Entity::
createAppropriateIterator
($response)¶
Parameters:
- $response (GuzzleHttp\Message\ResponseInterface) – The Guzzle response given by the Guzzle client after an API call’s execution
Returns: The only method publicly accessible, and the only method one needs to implement when building one’s own EntityFactory,
createAppropriateIterator
does what it says - it takes the Guzzle response provided to it, and builds an EntityIterator - a collection of Entities fitting for an API’s result.
EntityIterator¶
The EntityIterator is a collection object containing the appropriate entities (Swader\Diffbot\Abstracts\Entity
) of each API.
For example, executing a Product API call on a URL with a product will actually return an EntityIterator instance with a single element instance - a Swader\Diffbot\Entity\Product
. However, the EntityIterator also serves as a proxy to its first element, so accessing a property or a getter on the EntityIterator directly, will in fact access it on the first element. This allows for less verbose constructs. Compare:
$result = $api->call();
echo $result->getAuthor();
And:
$result = $api->call();
foreach ($result as $entity) {
echo $entity->getAuthor();
}
Assuming we called the Product API, the above snippets are identical logically, because the Product API only returns a single Product entity.
As evident above, the EntityIterator also acts as an array, and thus can be fully iterated through for when APIs return sets of entities rather than just one (see Swader\Diffbot\Api\Image
).
-
class
Swader\Diffbot\Entity\
EntityIterator
¶
__construct¶
Swader\Diffbot\Entity\EntityIterator::
__construct
(array $objects, $response)¶
Parameters:
- $objects (array) – An array of entities returned by the API
- $response (GuzzleHttp\Message\ResponseInterface) – The original Response object returned by the API, useful for getting raw data if you need to additionally process results
The EntityIterator is automatically constructed by
Swader\Diffbot\Factory\Entity
- you’ll almost never need to instantiate it yourself. It needs an array of objects, which is an array of Entities (Swader\Diffbot\Abstracts\Entity
) through which one can then iterate when processing results, and a Guzzle Response object which one can use to process the raw return data. See below.
Exceptions¶
This document contains the descriptions and throw cases for all exceptions in the client. Use this reference when you’re unsure why you may have gotten an exception.
DiffbotException¶
-
exception
Swader\Diffbot\Exceptions\
DiffbotException
¶
The DiffbotException is an empty exception class that extends the base PHP \Exception
. It is the base for all other Diffbot exceptions - though currently, it is the only one.
Its main current purpose is to let the user know that something went wrong with the client, not its dependencies or the application consuming it.
Interfaces¶
This document contains the descriptions for all interfaces used in the client.
Api¶
-
interface
Swader\Diffbot\Interfaces\
Api
¶ The API interface is there as a contract for developing custom APIs, not unlike the
Swader\Diffbot\Api\Custom
class.
setTimeout¶
Swader\Diffbot\Interfaces\Api::
setTimeout
($timeout = 30000)¶
Parameters:
- $timeout (int) – The timeout value in milliseconds. Defaults to 30000 (30 seconds)
Returns: $this
All Diffbot API endpoints support a timeout parameter which tells them after how many milliseconds to stop expecting a response from the page being processed.
call¶
Swader\Diffbot\Interfaces\Api::
call
()¶
Returns: Swader\Diffbot\Entity\EntityIterator
The call method should execute the remote call to the API. It must return an instance of
Swader\Diffbot\Entity\EntityIterator
containing the set of appropriate entities for the return value of said API. In custom APIs, these are usuallySwader\Diffbot\Entity\Wildcard
entities, unless otherwise specified via a custom implementation ofSwader\Diffbot\Interfaces\EntityFactory
.
buildUrl¶
Swader\Diffbot\Interfaces\Api::
buildUrl
()¶
Returns: string This method is called automatically when
Swader\Diffbot\Interfaces\Api::call
is called. It builds the URL which is to be called by the HTTPClient inSwader\Diffbot\Diffbot::setHttpClient
, and returns it. This method can be used to get the URL for the purposes of testing in third party API clients like Postman.
EntityFactory¶
-
interface
Swader\Diffbot\Interfaces\
EntityFactory
¶ The EntityFactory interface is there as a contract for developing custom Entity Factories. For example, you may want to make sure that a call to an API returns specific entities rather than
Swader\Diffbot\Entity\Wildcard
, or some of the predefined ones likeSwader\Diffbot\Entity\Product
. A specific example would be having a custom API which processes a site with board game cards. Each card has a specific value at a specific location, and these values may correspond. Rather than manually process data inSwader\Diffbot\Entity\Wildcard
entities after a call to this custom API, you might want to define aGameCard
entity and give it fields and methods specific to the context. A custom entity factory is then used to bind the newly defined entity with the custom API.
createAppropriateIterator¶
Swader\Diffbot\Interfaces\EntityFactory::
createAppropriateIterator
($response)¶
Parameters:
- $response (GuzzleHttp\Message\ResponseInterface) – The response received from the API call. Must be of the GuzzleHttp v5 type. Automatic if the Guzzle client is used, but version 5 only.
Returns: Returns the entity iterator containing the appropriate entities as built by the contents of
$response
.
Traits¶
All the traits used in the Diffbot PHP client are described in this one document.
DiffbotAware¶
-
trait
Swader\Diffbot\Traits\
DiffbotAware
¶ The DiffbotAware trait is there to make the API classes spawned by Diffbot aware of their parent, so that common configuration values and other factories can be accessed even after an API class has been instantiated.
Unless you’re implementing your own API class which doesn’t extend the
\Swader\Diffbot\Abstracts\Api
, you won’t need this.
registerDiffbot¶
-
Swader\Diffbot\Traits\DiffbotAware::
registerDiffbot
($d)¶ Parameters: - $d (\Swader\Diffbot\Diffbot) –
Swader\Diffbot\Diffbot
- an instance of the Diffbot main class to inject into children, like instances of various API classes.
Returns: $this
- $d (\Swader\Diffbot\Diffbot) –
StandardApi¶
-
trait
Swader\Diffbot\Traits\
StandardApi
¶ The StandardApi trait contains some methods common to most, if not all, API classes. These methods are setters for fields which appear in every Diffbot API: links, breadcrumb, meta and querystring. More information available under optional fields in various API doc files.
setLinks¶
setMeta¶
setBreadcrumb¶
-
Swader\Diffbot\Traits\StandardApi::
setBreadcrumb
($bool)¶ Parameters: - $bool (bool) – Either
true
orfalse
Returns: $this
Sets the
breadcrumb
optional field to true. The API then returns a top-level array (breadcrumb) of URLs and link text from page breadcrumbs.- $bool (bool) – Either
setQuerystring¶
-
Swader\Diffbot\Traits\StandardApi::
setQuerystring
($bool)¶ Parameters: - $bool (bool) – Either
true
orfalse
Returns: $this
Sets the
querystring
optional field to true. The API then returns any key/value pairs present in the URL querystring. Items without a discrete value will be returned as true.- $bool (bool) – Either
StandardEntity¶
-
trait
Swader\Diffbot\Traits\
StandardEntity
¶ The StandardEntity trait is here to add some common methods to the various entities. These make sense only in the standard entities, i.e. the data formats returned by Diffbot, which is why they aren’t in the abstract
\Swader\Diffbot\Abstracts\Entity
class. You probably won’t need this trait unless you define a\Swader\Diffbot\Api\Custom
API which offers fields of the same names as those returned by the getters in this trait.
getLang¶
Swader\Diffbot\Traits\StandardEntity::
getLang
()¶
Returns: string Returns the language code of the detected language of the processed content. The code returned is a two-character ISO 639-1 code: http://en.wikipedia.org/wiki/List_of_ISO_639-1_codes
getHumanLanguage¶
Swader\Diffbot\Traits\StandardEntity::
getHumanLanguage
()¶
Returns: string Alias method for
getLang()
above.
getPageUrl¶
Swader\Diffbot\Traits\StandardEntity::
getPageUrl
()¶
Returns: string Returns the URL which was processed
getResolvedPageUrl¶
Swader\Diffbot\Traits\StandardEntity::
getResolvedPageUrl
()¶
Returns: string Returns page URL which was resolved by redirects, if any. Will often be identical to result from
getPageUrl
above.
getTitle¶
Swader\Diffbot\Traits\StandardEntity::
getTitle
()¶
Returns: string Returns the title of the document which was processed.
getLinks¶
Swader\Diffbot\Traits\StandardEntity::
getLinks
()¶
Returns: array | null Returns an array of all links found on the processed page. Links will be simple string elements in an indexed array. If the
Swader\Diffbot\Traits\StandardApi::setLinks
method was not called, will returnnull
.
getMeta¶
Swader\Diffbot\Traits\StandardEntity::
getMeta
()¶
Returns: array | null Returns an array containing the full contents of page meta tags, including sub-arrays for OpenGraph tags, Twitter Card metadata, schema.org microdata, and – if available – oEmbed metadata. If the
Swader\Diffbot\Traits\StandardApi::setMeta
method was not called, will returnnull
.
getBreadcrumb¶
Swader\Diffbot\Traits\StandardEntity::
getBreadcrumb
()¶
Returns: array | null Returns a top-level array (breadcrumb) of URLs and link text from page breadcrumbs. If the
Swader\Diffbot\Traits\StandardApi::setBreadcrumb
method was not called, will returnnull
.
getQueryString¶
Swader\Diffbot\Traits\StandardEntity::
getQueryString
()¶
Returns: array | null Returns any key/value pairs present in the URL querystring. Items without a discrete value will be returned as true. If the
Swader\Diffbot\Traits\StandardApi::setQuerystring
method was not called, will returnnull
.