Baiduspider China

00:00:00 baiduspider 180.76.5.172 ~ China 01:18:27pm 01:18:27pm No
Last URL: /2010/page/13/
User Agent: Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)
Host: (n/a) n/a

So the above is what has been scraping my site for the last 3 weeks continuously! So I’ve resorted to banning the IP range & made several entries in my robots.txt file (although I don’t believe that this bot is acting in any compliance to such)

Although I’m not overly concerned for its presence, however the spam comment hit rate has skyrocketed since this has been interrogating me. China with it’s known history of human rights abuse appears to be scraping the internet, but what is being done with all this information? (not that I’d say there’s anything of any importance on this site!)

I admit that blocking such traffic goes against the grain of how I feel about internet freedom,but when a spider is crawling your site without disclosing a reason (and as I said my spam hit has increased) then I’m not going to allow such traffic as it has the potential to effect other services.

[scott@archbang-netbook ~]$ whois 180.76.5.100
% [whois.apnic.net node-5]
% Whois data copyright terms

inetnum: 180.76.0.0 - 180.76.255.255
netname: Baidu
descr: Beijing Baidu Netcom Science and Technology Co., Ltd.
descr: Baidu Plaza, No.10, Shangdi 10th street,Haidian District Beijing,100080
country: CN
admin-c: WN141-AP
tech-c: JC2179-AP
mnt-by: MAINT-CNNIC-AP
mnt-lower: MAINT-CNNIC-AP
mnt-routes: MAINT-CNNIC-AP
status: ALLOCATED PORTABLE
changed: hm-changed@apnic.net 20090715
source: APNIC

person: Nan Wang
address: Baidu Plaza, No.10, Shangdi 10th street,Haidian District Beijing,100080
country: CN
phone: +8610-59927164
fax-no: +8610-62684273
e-mail: wangnan@baidu.com
nic-hdl: WN141-AP
mnt-by: MAINT-CNNIC-AP
changed: ipas@cnnic.cn 20100322
source: APNIC

person: Jacky Chang
nic-hdl: JC2179-AP
address: 10th Floor No.6 2nd North Street Haidian District Beijing,100080
country: CN
phone: +8610-82602288-7280
fax-no: +8610-62684273
e-mail: zhangcheng@baidu.com
mnt-by: MAINT-CNNIC-AP
changed: ipas@cnnic.net.cn 20071227
source: APNIC

Now I’m no expert when it comes to APNIC requirements, but just take a look at some of the fields… I have my doubts that this is even a legit compay. 😐

[scott@archbang-netbook ~]$ dig 180.76.5.100

; < <>> DiG 9.8.1 < <>> 180.76.5.100
;; global options: +cmd
;; Got answer:
;; ->>HEADER< <- opcode: QUERY, status: NXDOMAIN, id: 49764 ;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 0 ;; QUESTION SECTION: ;180.76.5.100. IN A ;; AUTHORITY SECTION: . 10756 IN SOA a.root-servers.net. nstld.verisign-grs.com. 2011111001 1800 900 604800 86400 ;; Query time: 63 msec ;; SERVER: 192.168.1.254#53(192.168.1.254) ;; WHEN: Fri Nov 11 13:48:49 2011 ;; MSG SIZE rcvd: 105

So only time will tell if I got it right...

© 2011, Scott Evans. Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

9 Replies to “Baiduspider China”

  1. baidu is a chinese search engine, much the same as Google. Therefore I would suggest that frequent spidering would be normal from them. I doubt that its related to the spam, unless spammers are using baidu results to find blogs to spam…

    1. Hi Bruce, I only did a quick Google search on the name and there’s lots of info (that I did’t bother to read!) but even the changes I’ve made the bot is ignoring my robots.txt and unless I messed up the format in my .htaccess it’s getting trough that too 😐

  2. I have been seeing the same spider crawling my site quite a bit lately as well. And looking through my Drupal log, I can see just what pages it’s trying to hit. *None* of the pages exist. :/ I’ve also noticed that when this ‘spider’ started hitting my site, my spam comments started increasing drastically as well. Search engine or not, I’m blocking these IP ranges as well.
    It wouldn’t bother me that much (Mollom is an **excellent** spam block) but, all of the pages that it’s trying to find (along with the login and register pages, which don’t have active links anywhere on my site) makes me somewhat nervous.

  3. Also, as I’ve been digging into this a little more, when you run a whois on an official Baidu.com ip address, (220.181.111.86, 123.125.114.144, 220.181.111.85 are a few that the ‘host’ command pulls up) you get a completely different whois report than you do on the 180.76.5.* (I’m also getting hits from 180.76.6.*) ips. But, seeing that it’s not really a search engine bot, I doubt anything you put in robots.txt will help here. 😉

    1. Yes it’s clearly a non compliant spider purely for the purpose of flooding sites with spam. My spam count since this started is now well above 250, now that doesn’t sound like a lot but I did a fresh install of WordPress back in September so my spam count was reset!

      I’ve noticed that this is still scouring my blog even after placing IP blocks for the ranges that it’s claiming to be from, so as a guess its most likely using ToR to hide behind.

      1. After a few days worth of keeping an eye on this, it seems that I’ve mostly stopped the “spider” from accessing my site. At least in the apache access log, all the spider gets are 403’s. A quick and easy command to see all of the ip’s that the spider is using:
        sudo grep “Baiduspider” /var/log/apache2/access.log | cut -d’ ‘ -f1,9 | sort | uniq -c | sort -nr
        Also, the amount of spam that I get seems to have dropped somewhat since blocking the ip ranges. :/
        As you said, “only time will tell”. 😉

        1. Yes likewise so far so good but around 250 spam comments blocked is a testament to both Akismet and WP-reCapcha stopped them before they had a chance to flood this little blog of nothing! 😉

  4. Baidu seems Email harvester and try to steal other important data. In my webpage i always see it in rss file, where it can get mail id. But fortunately all mails are just dummy in there. I think it uses huge ip range. Like 220.* and 180* etc. When i try to block these ip range these ip range go wild and attack site with different ip range. I don,t understand how it bypass htaccess file.

    1. Well just when I thought all was going well however I’m now being targeted by a new one based in North Carolina 🙁 if I am to block this IP range would take out valid visitors to my site!

Comments are closed.