IT.COM

PHP Domain Extractor

NameSilo
Watch
Impact
83
This script may have some errors and it may not get all domains, but it's pulled every domain out of anything I've pasted thus far. Let me know of any errors that need addressed.

Code:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Domain Name Extractor - DomainIdiot © 2013-Present</title>
</head>

<body>
<h1>Domain Name Extractor, © 2013 - Present [DomainIdiot]</h1>
<form method="post" action="<?php echo @$_SERVER['PHP_SELF']; ?>">
	Copy and paste bulk data below to extract all domains from the data:<BR />
	<textarea name="rawData" style="width:800px;height:400px;"><?php echo htmlspecialchars(@$_POST["rawData"]); ?></textarea>
	<BR>
	<input type="submit" value="Parse Domains...">
</form>
<?php
function processDomains($domList){
	$extensionList = "ac,ae,aero,af,ag,al,am,as,asia,at,au,az,ba,be,bg,bi,biz,bj,br,br.com,bt,by,bz,ca,cat,cc,cd,ch,ck,cl,cn,cn.com,co,co.nl,com,coop,cx,cy,cz,de,dk,dm,dz,edu,ee,eg,es,eu,eu.com,fi,fo,fr,gb,gb.com,gb.net,qc.com,ge,gl,gm,gov,gr,gs,hk,hm,hn,hr,hu,hu.com,ie,il,in,info,int,io,iq,ir,is,it,je,jobs,jp,ke,kg,kr,la,li,lt,lu,lv,ly,ma,mc,md,me,mil,mk,mobi,ms,mt,mu,mx,my,name,net,nf,ng,nl,no,no.com,nu,nz,org,pl,pr,pro,pt,ro,ru,sa,sa.com,sb,sc,se,se.com,se.net,sg,sh,si,sk,sm,st,su,tc,tel,tf,th,tj,tk,tl,tm,tn,to,tp,tr,travel,tw,tv,tz,ua,uk,uk.com,uk.net,gov.uk,us,us.com,uy,uy.com,uz,va,vc,ve,vg,ws,xxx,yu,za.com";
	$eL = "";
	$extensionList = explode(",",$extensionList);
	$domList = str_replace("
"," ",$domList);
	$domList = str_replace("/"," ",$domList);
	$domList = str_replace("www.","",$domList);
	$domList = explode(" ",$domList);
	$full_return = "";
	
	foreach($extensionList as $a){
		$eL[] = ".".$a;
	}
	
    foreach ($domList as $a){
        $a = explode("	",$a);
            foreach($a as $b){
                if(strstr($b,".")){
                    $domain = explode(".",$b);
                    $prefix = $domain[0];
                    $extension = "";
                for($i=1;$i<=count($domain)-1;$i++){
                    $extension .= ".".$domain[$i];
                }
                if(strlen($prefix) >= 1 && strlen($extension) >= 1){
                    if(in_array(strtolower($extension),$eL)){
						$full_return[] = strtolower($prefix.$extension);
					}
                }                    
            }
        }
    }
	$full_return = @array_unique($full_return);
	$returnOut = "";
	
	if(@$full_return){
		foreach($full_return as $a){
			$returnOut .= $a."\n";
		}
	}
	return $returnOut;
}
echo nl2br(processDomains(@$_POST["rawData"]));
?>
</body>
</html>

Demo: http://www.freedomcatcher.com/domain_parser.php
 
Last edited:
2
•••
The views expressed on this page by users and staff are their own, not those of NamePros.
Looks good. Though I haven't' tested it on my server personally I can see some possible dilemmas arising from the following:

Code:
	$domList = str_replace("/"," ",$domList);
	$domList = str_replace("www.","",$domList);


From what I see str_replace("/" would separate the opening slash isolating the URL from the URI. But raw databases can also return '?' instead.

With str_replace("www.", you'd also eliminate domains like www.com, www.us, www.de etc, you'll need a conditional statement on those types of domains as an exception.

I was working on a domain extractor class myself but haven't completed it yet (part of a much larger project/app) and have a list of all TLD's ccTLD's if you need a copy, but it too needs updating considering the likes of .post , .xxx and all the others coming in the not to distant future.



.
 
Last edited:
0
•••
Looks good. Though I haven't' tested it on my server personally I can see some possible dilemmas arising from the following:

Code:
	$domList = str_replace("/"," ",$domList);
	$domList = str_replace("www.","",$domList);


From what I see str_replace("/" would separate the opening slash isolating the URL from the URI. But raw databases can also return '?' instead.

With str_replace("www.", you'd also eliminate domains like www.com, www.us, www.de etc, you'll need a conditional statement on those types of domains as an exception.

I was working on a domain extractor class myself but haven't completed it yet (part of a much larger project/app) and have a list of all TLD's ccTLD's if you need a copy, but it too needs updating considering the likes of .post , .xxx and all the others coming in the not to distant future.



.

Thanks for the input. I thought about that as I was coding but seeing as this is just a free script I didn't bother fixing it.

Another thing that this script has issues with is processing xx.xx domains (i.e. co.uk) which is a more important thing, but I'll get to it later when I have more time.

Thanks!
 
1
•••
Thanks for the input. I thought about that as I was coding but seeing as this is just a free script I didn't bother fixing it.

Yeah all to true! "chefs don't always give away their secret recipe" :D


Another thing that this script has issues with is processing xx.xx domains (i.e. co.uk) which is a more important thing, but I'll get to it later when I have more time.
Thanks!

Just add the co.uk to your $extensionList as with the co.nl which is there too.

The other main dilemma though is the subdomain levels that would need phrasing out too. That one held me up for a while on my extractor but I found a few ad-hoc solutions around it.




.
 
1
•••
Yeah all to true! "chefs don't always give away their secret recipe" :D




Just add the co.uk to your $extensionList as with the co.nl which is there too.

The other main dilemma though is the subdomain levels that would need phrasing out too. That one held me up for a while on my extractor but I found a few ad-hoc solutions around it.




.

I didn't realize that extension wasn't listed.. that would explain it lol. Thanks. I probably would've been looking at my code like wtf why is this not working for a while if you didn't say that.

Anyway, I'm not too worried about this script. If it was for a client, it would be fully featured and error free. This, on the other hand, I built to extract domains from lists on services such as namestation. It works in general, a few domains aren't going to hurt, especially with the *www.tld names.
 
1
•••
Just to make it a little shorter, there is a lot of balast.

Code:
function processDomains(&$domList){
        if ( empty($domList) ){
             return "No domains inputed";
        }

        // You should actually prepare this before and store it into DB or data file
	$extensionList = explode( ",",
"ac,ae,aero,af,ag,al,am,as,asia,at,au,az,ba,be,bg,bi,biz,bj,br,br.com,bt,by,bz,ca,cat,cc,cd,ch,ck,cl,cn,cn.com,co,co.nl,com,coop,cx,cy,cz,de,dk,dm,dz,edu,ee,eg,es,eu,eu.com,fi,fo,fr,gb,gb.com,gb.net,qc.com,ge,gl,gm,gov,gr,gs,hk,hm,hn,hr,hu,hu.com,ie,il,in,info,int,io,iq,ir,is,it,je,jobs,jp,ke,kg,kr,la,li,lt,lu,lv,ly,ma,mc,md,me,mil,mk,mobi,ms,mt,mu,mx,my,name,net,nf,ng,nl,no,no.com,nu,nz,org,pl,pr,pro,pt,ro,ru,sa,sa.com,sb,sc,se,se.com,se.net,sg,sh,si,sk,sm,st,su,tc,tel,tf,th,tj,tk,tl,tm,tn,to,tp,tr,travel,tw,tv,tz,ua,uk,uk.com,uk.net,gov.uk,us,us.com,uy,uy.com,uz,va,vc,ve,vg,ws,xxx,yu,za.com"
);
	$pattern = '/(\s{0,}|\.)([-a-z0-9]+\.('.implode("|", $extensionList).'))\s{1,}/i';

        $matches = array();

        if ( preg_match_all( $pattern, $domList, $matches ) ){

              return implode( "<br />", array_map( "trim", $matches[0] ) );

        }

        return "No domains found";
}

echo processDomains($_POST['rawData']);
 
0
•••
Back