说明
此操作是绕过爬虫检测,使用测试网站https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html中的几项;主要用到了js脚本,其中以WebDriver为例,主要用到了方法是:
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
由于直接在代码中执行js脚本,受限于页面刷新失效,以及我们想在页面加载前就执行无法做到,于是借助于代理软件mitmproxy,将返回200的网页进行js脚本插入;
1.安装代理mitmproxy和插件BeautifulSoup
pip install mitmproxy bs4
编写脚本在同路径:
inject.py:
from bs4 import BeautifulSoup
from mitmproxy import ctx
# load in the javascript to inject
with open('content.js', 'r') as f:
content_js = f.read()
def response(flow):
# only process 200 responses of html content
if flow.response.headers['Content-Type'] != 'text/html':
return
if not flow.response.status_code == 200:
return
# inject the script tag
html = BeautifulSoup(flow.response.text, 'lxml')
container = html.head or html.body
if container:
script = html.new_tag('script', type='text/javascript')
script.string = content_js
container.insert(0, script)
flow.response.text = str(html)
ctx.log.info('Successfully injected the content.js script.')
content.js:
//覆盖修改navigator.languages
Object.defineProperty(navigator, 'languages', {
get: function() {
return ['en-US', 'en'];
},
});
//覆盖修改navigator.plugins.length
Object.defineProperty(navigator, 'plugins', {
get: function() {
// this just needs to have `length > 0`, but we could mock the plugins too
return [1, 2, 3, 4, 5];
},
});
//覆盖修改navigator.webdriver
Object.defineProperty(navigator, 'webdriver', {
get: () => false,
});
2.执行脚本代理及启动浏览器开启调试
(1)执行脚本
mitmdump -p 8080 -s "inject.py
(2)启动浏览器
定位到chrome路径:
.\chrome.exe --proxy-server=localhost:8080 --remote-debugging-port=9222
(3)下载安装证书,支持抓包https
http://mitm.it/
测试代码
package com.selenium;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import java.util.concurrent.TimeUnit;
public class Test1 {
static {
System.setProperty("webdriver.chrome.driver", "C:\\Program Files (x86)\\Google\\Chrome\\Application\\chromedriver123.exe");
// 关闭 Chrome 浏览器驱动的日志输出
System.setProperty("webdriver.chrome.silentOutput", "true");
}
public static void main(String[] args) {
WebDriver driver = null;
try {
ChromeOptions option = new ChromeOptions();
//用户信息位置
option.addArguments("--disable-extensions");
//远程端口进行控制浏览器
option.setExperimentalOption("debuggerAddress", "127.0.0.1:9222");
driver = new ChromeDriver(option);
driver.manage().timeouts().implicitlyWait(20, TimeUnit.MILLISECONDS);
// 打开检测页面
driver.get("https://intoli.com/blog/not-possible-to-block-chrome-headless/chrome-headless-test.html");
Thread.sleep(1000000);
} catch (Exception e) {
e.printStackTrace();
} finally {
if (driver != null) {
// 关闭浏览器当前窗口
driver.close();
// 退出 webdriver 并关闭浏览器
driver.quit();
}
}
}
}
打开网站后,通过测试:
该教程基于网络文章实现https://intoli.com/blog/making-chrome-headless-undetectable/;
本文由 GY 创作,采用 知识共享署名4.0 国际许可协议进行许可
本站文章除注明转载/出处外,均为本站原创或翻译,转载前请务必署名
最后编辑时间为:
2023/01/30 07:58