selenium进阶使用之绕过爬虫检测

/ 后端运维 / 没有评论 / 320浏览

说明

此操作是绕过爬虫检测,使用测试网站https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html中的几项;主要用到了js脚本,其中以WebDriver为例,主要用到了方法是:

Object.defineProperty(navigator, 'webdriver', {
    get: () => false,
});

由于直接在代码中执行js脚本,受限于页面刷新失效,以及我们想在页面加载前就执行无法做到,于是借助于代理软件mitmproxy,将返回200的网页进行js脚本插入;

1.安装代理mitmproxy和插件BeautifulSoup

pip install mitmproxy bs4

编写脚本在同路径:

inject.py:

from bs4 import BeautifulSoup
from mitmproxy import ctx

# load in the javascript to inject
with open('content.js', 'r') as f:
    content_js = f.read()

def response(flow):
    # only process 200 responses of html content
    if flow.response.headers['Content-Type'] != 'text/html':
        return
    if not flow.response.status_code == 200:
        return

    # inject the script tag
    html = BeautifulSoup(flow.response.text, 'lxml')
    container = html.head or html.body
    if container:
        script = html.new_tag('script', type='text/javascript')
        script.string = content_js
        container.insert(0, script)
        flow.response.text = str(html)

        ctx.log.info('Successfully injected the content.js script.')

content.js:

//覆盖修改navigator.languages
Object.defineProperty(navigator, 'languages', {
  get: function() {
    return ['en-US', 'en'];
  },
});

//覆盖修改navigator.plugins.length
Object.defineProperty(navigator, 'plugins', {
  get: function() {
    // this just needs to have `length > 0`, but we could mock the plugins too
    return [1, 2, 3, 4, 5];
  },
});

//覆盖修改navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
    get: () => false,
});

2.执行脚本代理及启动浏览器开启调试

(1)执行脚本

mitmdump -p 8080 -s "inject.py

(2)启动浏览器

定位到chrome路径:

 .\chrome.exe --proxy-server=localhost:8080  --remote-debugging-port=9222

(3)下载安装证书,支持抓包https

http://mitm.it/

测试代码

package com.selenium;

import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;

import java.util.concurrent.TimeUnit;

public class Test1 {
    static {
        System.setProperty("webdriver.chrome.driver", "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chromedriver123.exe");
        // 关闭 Chrome 浏览器驱动的日志输出
        System.setProperty("webdriver.chrome.silentOutput", "true");
    }

    public static void main(String[] args) {
        WebDriver driver = null;
        try {
            ChromeOptions option = new ChromeOptions();
            //用户信息位置
            option.addArguments("--disable-extensions");
            //远程端口进行控制浏览器
            option.setExperimentalOption("debuggerAddress", "127.0.0.1:9222");
            driver = new ChromeDriver(option);
            driver.manage().timeouts().implicitlyWait(20, TimeUnit.MILLISECONDS);
            // 打开检测页面
            driver.get("https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html");

            Thread.sleep(1000000);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            if (driver != null) {
                // 关闭浏览器当前窗口
                driver.close();
                // 退出 webdriver 并关闭浏览器
                driver.quit();
            }
        }
    }
}

打开网站后,通过测试:

该教程基于网络文章实现https://intoli.com/blog/making-chrome-headless-undetectable/;