User cURL Functions to Crawl Links from WebsitesPHP is incredibly powerful and allows you to do some really cool stuff like crawl all links on a website of your choice. Once you have this data you can do whatever you’d like with it such as save it to a database or manipulate it to suit your needs.

Simple PHP Link Crawler Demo

Try it out! Input the url of the website you’d like to crawl using the form below:



Building a Simple PHP Crawler

The code below will out all the hyperlinks on the target url. To use it simple create a new PHP document and save to your server.

You can see from this code that creating a really powerful spider with PHP cURL functions isn’t that hard to do.

Let’s Discuss More about this Link Spider

Some would hesitate to call this an actual spider since it’s only crawling one specified page but I beg to differ. This is the makings of a basic search spider, it just needs some additional automation and AI. You can easily use this code as a starting point for a more complex PHP crawler.

The cURL PHP Function Library

Develop your own spider with cURLcURL is PHP’s the “client URL function library”. Meaning, it’s the set of functions that allow you query remote servers. It’s your first step to creating a PHP-based Search Engine, robot or link/keyword checker. The library allows you to connect and communicate to various types of servers running on different types of protocols.

cURL and Regular Expressions

Using a loop and regular expressions allows you to really fine-tune the spider to pull specific on-page elements like images, videos and link as seen in this demo. It’s actually possible to develop your spider to learn from it’s mistakes using regular expressions.

Some Sites Don’t Work?

Most likely the sites you are experiencing crawling difficulties on are blocking the access to their protocol and therefore not returning any data. These are typically larger sites like Facebook and Google.

Why are there blank lines?

Those are the links returned without any text. Perhaps they are wrapping an image or used for some other purpose.

Expanding on this Functionality

The limits of this code are endless and depend on your ingenuity. Know what your goal is and strive to develop an App that meets that objective efficiently. Use the community as your resource and never stop pushing the limits of what you know already.

Have a question? Confused? Leave a comment below and don’t forget to “Like” WordImpress on Facebook! Hope you enjoyed this article.

Devin Walker is a San Diego-based WordPress Developer and enthusiast. He is the author of several popular and highly-rated WordPress themes and plugins. In his free time he enjoys playing Golf and traveling.

Follow
Devin

Is your WordPress Running Slow?

Milliseconds are money. Website speed is extremely important. We highly recommend WP Engine. Why? Because they make a sluggish WP site fast, support is great and they have excellent uptime.

WP Engine Logo
  • Andy

    Got to love ROBOT

  • Andy

    Got to love ROBOT

  • http://shebasoft.com/ ShebaSoft

    awsome :D

    thank you

  • http://shebasoft.com/ ShebaSoft

    awsome :D

    thank you

  • Matt Wandel

    If you had a list of URLs where images were and a list of corresponding new names that you wanted those images to be named, how would you loop through your list and retrieve and rename the images.

    For example, I have:
    Where I have a whole bunch of $remote_img / $image_name combinations in a spreadsheet.  I want to loop through my spreadsheet list and harvest and rename each photo I have in my spreadhsheet.The code above works, but I want to make $remote_img and $img_name dynamic and loop through them.Is this something easy for you to code and give me a nudge.  Total newbie.  I appreciate your site.Thanks,-Matt

    • Mattwandel

       Your site removed the p h p code I pasted in.  Must be a security thing.  I hope you can see what I sent.  I’m not typing all that again.  Cheers.  Matt.

  • Matt Wandel

    If you had a list of URLs where images were and a list of corresponding new names that you wanted those images to be named, how would you loop through your list and retrieve and rename the images.

    For example, I have:
    Where I have a whole bunch of $remote_img / $image_name combinations in a spreadsheet.  I want to loop through my spreadsheet list and harvest and rename each photo I have in my spreadhsheet.The code above works, but I want to make $remote_img and $img_name dynamic and loop through them.Is this something easy for you to code and give me a nudge.  Total newbie.  I appreciate your site.Thanks,-Matt

    • Mattwandel

       Your site removed the p h p code I pasted in.  Must be a security thing.  I hope you can see what I sent.  I’m not typing all that again.  Cheers.  Matt.

  • Igor Magrini

    Hi i tried this code on my website, but it returns two warnings, can you help me?

    Warning: preg_match_all() [function.preg-match-all]: Unknown modifier ‘a’ in /home/igormagrini/www/crawl/example3.php on line 12

    Warning: Invalid argument supplied for foreach() in /home/igormagrini/www/crawl/example3.php on line 14