Benvenuto Visitatore! (Registrati|Log-in) - PM:
HomePage > Programmare in PHP > [GUIDA] Esempio di Spider da riga di comando

I 10 messaggi nuovi dalla tua ultima visita
Non sei connesso al forum! Esegui il login o registrati ed inizia a postare!

Forums

#1
GidanMX2


Utente n°: 1
Messaggi inviati: 78
Topic aperti: 106
% Mess. inviati: 74%
Status: Off-line

[GUIDA] Esempio di Spider da riga di comando - Posted by GidanMX2 il 26/09/2007 alle 09:28:45

Più che una guida è una codice, ma può essere utile a chi vuole realizzare qualcosa di simile o vuole scrivere uno script per la shell, come ogni vero programmatore non pubblico commenti, perchè ritengo che il codice e i commenti esistenti siano già più che sufficienti (ps: se non lo avete ancora fatto leggete "vero programmatore").


#!/usr/bin/php
<?php                          //Like Those True Ones
/*******************************************************************
 *                                                                 *
 *    A PHP written, Object Oriented web spider.                 *
 *                                                                 *
 *    Author: Michele 'GidanMX2' Nolo Belina                     *
 *    Contacts: gidanmx2@gmail.com                               *
 *          http://www.koalasoft.net/                        *
 *    License: Released under GNU Genera Public License V. 2       *    
 *                                   *
 *******************************************************************/

class Spider{
    
    public 
$start_url;    //La URL di partenza dello spider    

    /*
        Costruttore della classe
    */
    
function __construct(){

        
/*
            Parte dedicata all'analisi degli argomenti        
        */
        
foreach($GLOBALS['argv'] as $key=>$arg){
            if(
$key 0){
                switch(
$arg){
                    case 
'-h':
                        die(
"Usage: kernel [options] <url>\n  -h \t\tShow this message\n\n");
                    break;    
                    default:
                        if(!
fopen($arg,'r')){
                            die(
"Usage: kernel [options] <url>\n  -h \t\tShow this message\n\n");
                        } else {
                            
$this->start_url $arg;
                        }                    
                    break;            
                }            
            }            
        }    
    }

    
/*
        Porta lo spider allo stato "Ready", in atessa di comandi
    */
    
function ready(){
        echo 
"Spider console\n\n";
        while(
true){
            echo 
"spider @ ".$this->start_url." > "
            if(
TRUE==($input=fgets(STDIN))){
                
$input str_replace("\n",'',$input); 
                if(
preg_match('/^find\ (.*)$/i',$input,$matches)){
                    switch(
$matches[1]){
                        case 
'links':
                            
$this->search_links($matches[1]);
                        break;
                        case 
'mails':
                            
$this->search_mails($matches[1]);
                        break;
                    }
                }else{
                    if(
preg_match('/^cd\ (.*)$/i',$input,$matches)){
                        if(
eregi('http://',$matches[1])){
                            if(
fopen($matches[1],'r')){
                                
$this->start_url=$matches[1];    
                            }else{
                                echo 
"Spider: cd: unable to open request URL.";                            
                            }
                        }else{
                            
/*
                                Interpreta i ..                            
                            */
                            
if(preg_match('/^\.\.\/(.*)$/i',$matches[1]) or $matches[1]=='..'){
                                
$url_info parse_url($this->start_url);
                                
$sub_dirs explode('/',$url_info['path']);
                                
$tot count($sub_dirs);
                                
$j=0;
                                while(
preg_match('/\.\.\//i',$matches[1],$backs)){
                                    
$j++;
                                    
$matches[1]=preg_replace('/\.\.\//','',$matches[1]);
                                }
                                
$go_to $tot-$j;
                                if(
fopen($url_info['scheme'].'://'.$url_info['host'].'/'.$sub_dirs[$go_to],'r')){
                                    
$this->start_url=$url_info['scheme'].'://'.$url_info['host'].'/'.$sub_dirs[$go_to];                            
                                }else{
                                    echo 
"Spider: cd: unable to open request URL.";                                
                                }
                            }else{
                                
/*
                                    Interpreta il .                                
                                */
                                
$matches[1] = preg_replace('/^(\.?\/?)(.*)$/i','/\\2',$matches[1]);
                                if(
fopen($this->start_url.$matches[1],'r')){
                                    
$this->start_url.=$matches[1];    
                                }else{
                                    echo 
"Spider: cd: unable to open request URL.";                            
                                }
                            }
                        }
                    }else{
                        if(
$input=="?"){
                            echo 
"Help:\n   find <mails>|<links>\tSearch for links or mails in the page where the spider stay.\n   cd <url>\t\tChange spider location\n\n";
                        }else{
                            echo 
"Spider: $input: command not found. \"?\" for help.\n";
                        }
                    }    
                }
            }
        }        
    }

    
/*
        Cerca links all'interno di una pagina    
    */
    
function search_links($keywords){
        echo 
"\n&raquo; Search for '".$keywords."' in ".$this->start_url."...\n";
        if(
$txt=file_get_contents($this->start_url)){
            
//Trova tutti i links :-)
            
if(preg_match_all('/href=([^\ |^>]+)/i',$txt,$links)){
                foreach(
$links[1] as $link){
                    
//Tolgo apici doppi e singoli
                    
$link preg_replace('/^([\'|"]{1})(.*)([\'|"]{1})$/i','\\2',$link);    
                    echo 
"\tFound: $link\n";        
                }
            }else{
                echo 
"Spider: find: No links founds!\n";            
            }                
        } else {
            echo 
"Spider Error: Unable to open request URL.\n";        
        }    
    }

    
/*
        Cerca mails all'interno di una pagina
    */
    
function search_mails($keywords){
        echo 
"\n&raquo; Search for '".$keywords."' in ".$this->start_url."...\n";
        if(
$txt=file_get_contents($this->start_url)){
            
//Trova tutti le mails :-)
            
if(preg_match_all('/(?:[a-z0-9!#$%&amp;\'*+\/=?^_`{}~-]+(?:\.[a-z0-9!#$%&amp;\'*+\/=?^_`{}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")@(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])/',$txt,$mails)){ 
                foreach(
$mails[0] as $mails){    
                    echo 
"\tFound: $mails\n";        
                }
            }else{
                echo 
"Spider: find: No mails founds!\n";
            }                
        } else {
            echo 
"Spider Error: Unable to open request URL.\n";        
        }    
    }
}

$spider = new Spider();
$spider->ready();
?>



________________________________
Modificato da GidanMX2 il 26/09/07 alle 17:33:34


Non è tempo per noi che non vestiamo come voi
Non ridiamo, non piangiamo, non amiamo come voi
Forse ingenui o testardi
Poco furbi casomai
Non è tempo per noi e forse non lo sarà mai

1

Stats
Ci sono 2 utenti on-line: Visitatore/i
Abbiamo 50 utenti registrati. Diamo il benvenuto all'ultimo arrivato: bliriuffita!
Questo forum ha un totale di 126 posts divisi in 123 topics.