PHP and Wget sitemap generator for search engines
Sitemaps are of primary importance in Search Engine Optimization ( SEO for friends :) )
PHP Wget Sitemap Generator class takes advantage of wget to get a local mirror of the target site and then generates the url list for the sitemap via local directory listing.
Wget is highly configurable so read the man page of best options to use (replace exec function argument).
The code:
Version: 0.2
<?php
// PHP Wget Sitemap generator v0.2
// (c) 2008 by Paolo Ardoino < paolo.ardoino@gmail.com >
class PHPWgetSitemap {
public $opts = array("sitemap_file" => "sitemap.xml", "website_url" => "");
public $siteamp = array();
function __construct() {
echo "PHP Wget Sitemap generator v0.2\t(c) 2008 by Paolo Ardoino < paolo.ardoino@gmail.com >\n";
}
function setSitemapFile($sitemap_file) {
$this->opts["sitemap_file"] = $sitemap_file;
}
function setWebsiteUrl($website_url) {
$this->opts["website_url"] = $website_url;
}
function mirror() {
if($this->opts["website_url"] != "") {
echo "Wget: fetching ‘".$this->opts["website_url"]."’ website\n";
exec("wget -m ".$this->opts["website_url"]." 2> wget.log");
}
}
function generate() {
if($this->opts["website_url"] != "") {
$website_dir = substr($this->opts["website_url"], 7);
if($website_dir != "") {
echo "PHPWgetSitemap: scanning ‘".$website_dir."’ for sitemap generation\n";
$this->sitemap = $this->_scan($website_dir);
$this->ssave();
}
}
}
function _scan($dir) {
$sitemap = array();
$FILES_EXCLUDE = array(".", "..", "index.php", "index.html", "index.htm");
if($dir != "") {
if (is_dir($dir)) {
if ($handle = opendir($dir)) {
chdir($dir);
$sitemap[] = $dir."/";
while (false !== ($file = readdir($handle))) {
if (!in_array($file, $FILES_EXCLUDE)) {
if(is_dir($file)) {
$arr = $this->_scan($file);
foreach ($arr as $value) {
$sitemap[] = $dir."/".$value;
}
} else {
$sitemap[] = $dir."/".$file;
}
}
}
chdir("../");
}
closedir($handle);
}
}
return $sitemap;
}
function ssave() {
$sitemap_file = $this->opts["sitemap_file"];
if($sitemap_file != "") {
if($fp = fopen($sitemap_file, "w+")) {
$out = ‘<?xml version="1.0" encoding="UTF-8"?>
<urlset
xmlns="http://www.google.com/schemas/sitemap/0.84"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.google.com/schemas/sitemap/0.84
http://www.google.com/schemas/sitemap/0.84/sitemap.xsd">’;
for($i = 0, $y = sizeof($this->sitemap); $i < $y; $i++) {
$out .= "<url>\n\t<loc>http://".$this->sitemap[$i]."</loc>\n\t<priority>0.500</priority>\n</url>\n";
}
$out .= ‘</urlset>’;
fputs($fp, $out);
fclose($fp);
echo "Sitemap has been written to ‘".$sitemap_file."’.\n";
} else {
echo "Error: cannot save ‘".$sitemap_file."’ file.\n";
}
}
}
}
$n = new PHPWgetSitemap();
$n->setWebsiteUrl("http://ardoino.com");
$n->mirror();
$n->generate();
?>
Download this code: phpwgetsitemap.txt


Hi. Thanks for the scripts
I’m working with you PHP Wget Sitemap generator v0.2 script and was wondering if it’s possible to exclude gifs?
Hello,
from wget man page you can see two options: –reject and –accept
You need this: –reject .gif
So edit this line:
exec(”wget -m “.$this->opts["website_url"].” 2> wget.log”);
as follows:
exec(”wget -R .gif -m “.$this->opts["website_url"].” 2> wget.log”);
This should work
Paolo:
My web server is stuck at php v4. The ‘Public’ scope is not available in php4. Is there an alternative scope I can use to make your class work in php4?
Paolo:
Ignore the message above. I was able to use ‘var’ in place of public and the class seems to be working ok using that change under php4.
Ok, New question for Paolo,
Is it possible to disable ‘mirror’ functionality? I have a VERY large site, and apparently the mirror function completely exhausted my 1G of disk space.
I completely agree with all that here is told PHP and Wget sitemap generator for search engines
Fredi
Paolo:
My sitemap contains more than 50,000 urls. Do you know how can I split the file?
Germán
You should create multiple sitemaps and submit a sitemap index file.