Académique Documents
Professionnel Documents
Culture Documents
Corporate Blog
Here, I am going to explain how to perform basic web scraping using Goutte and Symfony DomCrawler, and how to get
machine-readable information from Web pages by way of Web scraping. Currently, most of the API documentation
process is not written by hand, and such documentations are generated by tools meant for this purpose. There are several
tools available in the market for API document generation such as PHPDocumentor or Sami (these are more popular and
reliable).
Now, interestingly, we will reverse this process of creating documentation from code, and thereby generate code from
documents!
Required Installation
Before going to use DomCrawler, obviously, you need to install it: https://github.com/FriendsOfPHP/Goutte
Only after successful installation can we be able to use the Symfony DomCrawler, since Symfony DomCrawler uses the
service of Goutte.
Now, start a simple DomCrawler to find the available links from the web page.
Add the below lines above the class name of the file src/AppBundle/Controller/DefaultController.php
use Goutte\Client;
use Symfony\Component\DomCrawler\Crawler;
Add the below lines in the bottom of all the methods of the file src/AppBundle/Controller/DefaultController.php
/**
* @Route("/links", name="crawler")
*/
public function crawlerAction()
{
$url = "http://www.agiratech.com";
$client = new Client();
$crawler = $client->request('GET', $url);
$links_count = $crawler->filter('a')->count();
$all_links = [];
if($links_count > 0){
$links = $crawler->filter('a')->links();
foreach ($links as $link) {
$all_links[] = $link->getURI();
}
$all_links = array_unique($all_links);
echo "All Avialble Links From this page $url Page<pre>"; print_r($all_links);echo "</pre>";
} else {
echo "No Links Found";
}
die;
}
Here, I have created the new router http://localhost/links for my application (http://localhost is my local domain name)
and created one object for Client class and named it as $client. Using this object I will call a request method to gather
information in that page like the following line
From the line $crawler->filter(a)->count() we can find HTML <a> tag count in the particular page
(http://www.agiratech.com).
Therefore, similarly, from this line $crawler->filter(a)->links() we can get the all the links form the particular page.
Similarly, again, from the line $link->getURI() we can get each of the links of the particular page.
Conclusion
The above example shows how to extract all the links from the HTML document and save them in an array as $all_links.
Likewise, we can extract several data from the particular web page.
In fact, many more powerful activities can be performed and code be extracted. For instance, in the above example, we
can even travel into all the pages from the links present, and find many more information as required. I will handle more
such extraction performances with different examples in future blogs. Try it out for yourself
0 Comment
Leave a comment
Name
Website
Your Comment...
Send Comment
Search ...
Recent Posts
Basic web scraping using Goutte and Symfony DomCrawler
Archives
March 2017
February 2017
January 2017
December 2016
November 2016
October 2016
September 2016
August 2016
July 2016
June 2016
May 2016
April 2016
March 2016
February 2016
January 2016
Categories
Amazon Web Services (AWS)
AngularJs
API
Big Data
Code Study
DevOps
Docker
GitHub
Go
Golang
Javascript FrameWorks
Laravel
Management
Metrics
Non-Technical
PostgreSql
ReactJS
Ruby
Ruby on Rails
SocialMedia
Standard
Technical
Unix
Web Development
WordPress
Contact
Us
Email : info@agiratech.com
INDIA : +91 44 4357 4451
USA: +1 888 50 AGIRA (+1 888-502-4472)
Terms of Service
Privacy policy
CHENNAI
INDIA
Agira Technologies Pvt Ltd,
#42/32, 4th Floor, Gee Gee Complex,
42, Anna Salai,
Chennai - 600 002, India.
Social
Media