反爬虫对抗基础:请求伪装
当你开始爬取一些有价值的网站时,很快就会发现:不是所有网站都欢迎爬虫。反爬虫与反反爬虫的对抗,是爬虫工程师必须面对的课题。本章我们将学习最基础也最重要的反爬对抗技术——请求伪装。
学习目标:掌握 User-Agent 轮换、请求头伪装、TLS 指纹模拟和速率控制等核心技术,让你的爬虫不易被识别。
反爬虫机制概述
反爬检测流程
为什么网站要反爬虫
在开始学习反爬技术之前,我们需要理解网站为什么要反爬虫:
- 保护数据资产:数据是有价值的,网站不希望被批量获取
- 保护服务器资源:爬虫会消耗服务器带宽和计算资源
- 防止恶意行为:如价格监控、竞品分析、数据倒卖等
- 合规要求:某些数据有法律保护要求
常见的反爬虫检测手段
| 检测类型 | 检测方式 | 难度 | 常见应对 |
|---|---|---|---|
| 请求特征检测 | User-Agent、请求头完整性 | 低 | 伪装请求头 |
| 行为特征检测 | 访问频率、访问路径 | 中 | 速率控制、随机延迟 |
| API签名检测 | 参数签名验证 | 中 | 逆向签名算法 |
| Cookie检测 | 登录状态验证 | 中 | 登录获取Cookie |
| 浏览器指纹检测 | JS 环境、Canvas、WebGL | 高 | 使用真实浏览器 |
| 验证码检测 | 图片验证码、滑块验证码 | 高 | OCR、打码平台 |
本章主要聚焦于请求特征检测的对抗,这是最基础的反爬手段。
User-Agent 策略
什么是 User-Agent
User-Agent(简称 UA)是 HTTP 请求头中的一个字段,用于标识发起请求的客户端类型。例如:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36这个字符串包含了:
- 浏览器类型(Chrome)
- 浏览器版本(120.0.0.0)
- 操作系统(Windows 10 64位)
- 渲染引擎(AppleWebKit/537.36)
为什么要轮换 User-Agent
如果你的爬虫始终使用同一个 UA,会有以下风险:
- 特征明显:Python 默认的 UA 是
python-requests/2.x.x,一眼就能识别 - 容易被追踪:同一 UA 的大量请求会被关联分析
- 容易被封禁:一旦被识别,可以按 UA 封禁
实现 User-Agent 轮换
方法一:手动维护 UA 列表
import random
# 桌面浏览器 UA 列表
DESKTOP_USER_AGENTS = [
# Chrome Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
# Chrome Mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
# Firefox Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
# Firefox Mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0",
# Safari Mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
# Edge Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0",
]
# 移动端 UA 列表
MOBILE_USER_AGENTS = [
# iPhone Safari
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1",
# Android Chrome
"Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36",
# Android Samsung
"Mozilla/5.0 (Linux; Android 14; SM-S918B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36",
]
def get_random_ua(mobile: bool = False) -> str:
"""获取随机 User-Agent"""
ua_list = MOBILE_USER_AGENTS if mobile else DESKTOP_USER_AGENTS
return random.choice(ua_list)方法二:使用 fake-useragent 库
pip install fake-useragentfrom fake_useragent import UserAgent
# 创建 UserAgent 对象
ua = UserAgent()
# 获取随机 UA
print(ua.random) # 完全随机
print(ua.chrome) # 随机 Chrome UA
print(ua.firefox) # 随机 Firefox UA
print(ua.safari) # 随机 Safari UA实现 UA 轮换器
import random
from typing import List, Optional
from fake_useragent import UserAgent
class UARotator:
"""User-Agent 轮换器"""
def __init__(self, use_fake_ua: bool = True, custom_uas: Optional[List[str]] = None):
"""
初始化 UA 轮换器
Args:
use_fake_ua: 是否使用 fake-useragent 库
custom_uas: 自定义 UA 列表
"""
self.use_fake_ua = use_fake_ua
self.custom_uas = custom_uas or []
if use_fake_ua:
try:
self._fake_ua = UserAgent()
except Exception:
self.use_fake_ua = False
def get_random(self) -> str:
"""获取随机 UA"""
# 优先使用自定义列表
if self.custom_uas:
return random.choice(self.custom_uas)
# 使用 fake-useragent
if self.use_fake_ua:
return self._fake_ua.random
# 默认返回 Chrome UA
return (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36"
)
def get_chrome(self) -> str:
"""获取 Chrome UA"""
if self.use_fake_ua:
return self._fake_ua.chrome
return self.get_random()
def get_mobile(self) -> str:
"""获取移动端 UA"""
mobile_uas = [
"Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15",
"Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36",
]
return random.choice(mobile_uas)请求头完整伪装
为什么仅有 User-Agent 不够
很多网站不仅检测 UA,还会检测其他请求头字段。真实浏览器的请求头是非常丰富的:
GET /api/data HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Cache-Control: max-age=0而 Python 默认的请求头非常简陋,很容易被识别。
构建完整的请求头
from typing import Dict, Optional
class HeadersBuilder:
"""请求头构建器"""
# 基础请求头模板
BASE_HEADERS = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
"Sec-Fetch-User": "?1",
"Cache-Control": "max-age=0",
}
# API 请求头模板
API_HEADERS = {
"Accept": "application/json, text/plain, */*",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Sec-Fetch-Dest": "empty",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Site": "same-origin",
}
def __init__(self, ua_rotator: Optional['UARotator'] = None):
self.ua_rotator = ua_rotator or UARotator()
def build(
self,
referer: Optional[str] = None,
origin: Optional[str] = None,
extra_headers: Optional[Dict[str, str]] = None,
is_api: bool = False
) -> Dict[str, str]:
"""
构建请求头
Args:
referer: Referer 地址
origin: Origin 地址
extra_headers: 额外的请求头
is_api: 是否是 API 请求
Returns:
完整的请求头字典
"""
# 选择基础模板
headers = self.API_HEADERS.copy() if is_api else self.BASE_HEADERS.copy()
# 添加 User-Agent
headers["User-Agent"] = self.ua_rotator.get_random()
# 添加 Referer
if referer:
headers["Referer"] = referer
# 如果有 Referer,通常 Sec-Fetch-Site 应该是 same-origin 或 cross-site
headers["Sec-Fetch-Site"] = "same-origin"
# 添加 Origin(通常用于 POST 请求或 CORS 请求)
if origin:
headers["Origin"] = origin
# 合并额外请求头
if extra_headers:
headers.update(extra_headers)
return headers
def build_for_ajax(
self,
referer: str,
x_requested_with: bool = True
) -> Dict[str, str]:
"""
构建 AJAX 请求头
Args:
referer: Referer 地址
x_requested_with: 是否添加 X-Requested-With 头
Returns:
AJAX 请求头
"""
headers = self.build(referer=referer, is_api=True)
if x_requested_with:
headers["X-Requested-With"] = "XMLHttpRequest"
return headersReferer 的正确设置
Referer 表示当前请求是从哪个页面发起的。正确设置 Referer 很重要:
async def crawl_with_referer(client, list_url: str, detail_urls: list):
"""演示正确设置 Referer"""
headers_builder = HeadersBuilder()
# 访问列表页
list_headers = headers_builder.build()
response = await client.get(list_url, headers=list_headers)
# 访问详情页时,Referer 应该是列表页
for detail_url in detail_urls:
detail_headers = headers_builder.build(referer=list_url)
response = await client.get(detail_url, headers=detail_headers)请求头实战配置
以下是一个通用的请求头构建器,可以根据目标网站进行定制:
# headers_builder.py - 通用请求头构建器
"""通用请求头配置"""
class SiteHeadersBuilder:
"""站点请求头构建器"""
def __init__(self, base_url: str):
"""
初始化构建器
Args:
base_url: 目标网站基础URL
"""
self.base_url = base_url
self.base_headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/120.0.0.0 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Sec-Fetch-Dest": "document",
"Sec-Fetch-Mode": "navigate",
"Sec-Fetch-Site": "none",
}
def build_for_page(self, referer: str = "") -> dict:
"""
构建页面请求头
Args:
referer: Referer 地址
Returns:
完整的请求头字典
"""
headers = self.base_headers.copy()
if referer:
headers["Referer"] = referer
headers["Sec-Fetch-Site"] = "same-origin"
return headers
def build_for_api(self, referer: str = "") -> dict:
"""构建 API 请求头"""
headers = self.base_headers.copy()
headers["Accept"] = "application/json, text/plain, */*"
headers["Sec-Fetch-Dest"] = "empty"
headers["Sec-Fetch-Mode"] = "cors"
if referer:
headers["Referer"] = referer
headers["Sec-Fetch-Site"] = "same-origin"
return headers请求头配置要点:
- Referer:很多网站会检查,应设置为合理的来源页面
- Sec-Fetch 系列:现代浏览器标准头,建议保持完整
- Accept:页面请求和 API 请求的 Accept 头不同
使用 curl_cffi 模拟浏览器指纹
什么是 TLS 指纹
除了 HTTP 请求头,服务器还可以通过 TLS(HTTPS 握手)的特征来识别客户端。不同的客户端(浏览器、Python requests、curl 等)在 TLS 握手时会展现不同的特征:
- 支持的加密套件顺序
- 支持的 TLS 扩展
- 椭圆曲线参数
Python 的 requests 和 httpx 使用的 TLS 指纹与真实浏览器差异很大,容易被识别。
curl_cffi 简介
curl_cffi 是一个可以模拟各种浏览器 TLS 指纹的 HTTP 客户端库。
安装:
pip install curl_cffi基本使用
from curl_cffi import requests
# 模拟 Chrome 浏览器
response = requests.get(
"https://tls.browserleaks.com/json",
impersonate="chrome120" # 模拟 Chrome 120
)
print(response.json())
# 支持的浏览器指纹
# chrome99, chrome100, chrome101, ..., chrome120
# chrome99_android
# edge99, edge101
# safari15_3, safari15_5, safari17_0异步使用
from curl_cffi.requests import AsyncSession
async def fetch_with_curl_cffi():
"""使用 curl_cffi 的异步请求"""
async with AsyncSession(impersonate="chrome120") as session:
response = await session.get("https://httpbin.org/headers")
print(response.json())封装 curl_cffi 客户端
from curl_cffi.requests import AsyncSession
from typing import Optional, Dict, Any
import random
class BrowserClient:
"""
模拟浏览器的 HTTP 客户端
使用 curl_cffi 模拟真实浏览器的 TLS 指纹
"""
BROWSER_VERSIONS = [
"chrome119",
"chrome120",
"edge99",
"edge101",
"safari15_5",
"safari17_0",
]
def __init__(
self,
impersonate: Optional[str] = None,
proxy: Optional[str] = None,
timeout: int = 30
):
"""
初始化客户端
Args:
impersonate: 模拟的浏览器,如 "chrome120",None 表示随机
proxy: 代理地址
timeout: 超时时间
"""
self.impersonate = impersonate
self.proxy = proxy
self.timeout = timeout
self._session: Optional[AsyncSession] = None
async def __aenter__(self):
await self.start()
return self
async def __aexit__(self, *args):
await self.close()
async def start(self):
"""启动会话"""
browser = self.impersonate or random.choice(self.BROWSER_VERSIONS)
self._session = AsyncSession(
impersonate=browser,
proxy=self.proxy,
timeout=self.timeout
)
async def close(self):
"""关闭会话"""
if self._session:
await self._session.close()
self._session = None
async def get(
self,
url: str,
headers: Optional[Dict[str, str]] = None,
**kwargs
) -> Any:
"""发送 GET 请求"""
if not self._session:
await self.start()
return await self._session.get(url, headers=headers, **kwargs)
async def post(
self,
url: str,
data: Optional[Dict] = None,
json: Optional[Dict] = None,
headers: Optional[Dict[str, str]] = None,
**kwargs
) -> Any:
"""发送 POST 请求"""
if not self._session:
await self.start()
return await self._session.post(
url, data=data, json=json, headers=headers, **kwargs
)速率控制
为什么需要速率控制
即使你的请求伪装得再好,如果访问频率过高,也会触发反爬机制。人类的浏览行为是有一定节奏的,而机器的请求往往过于规律或过于密集。
基本延迟策略
import asyncio
import random
async def crawl_with_delay(urls: list, min_delay: float = 1.0, max_delay: float = 3.0):
"""
带随机延迟的爬取
Args:
urls: URL 列表
min_delay: 最小延迟(秒)
max_delay: 最大延迟(秒)
"""
for url in urls:
# 发送请求
response = await fetch(url)
# 随机延迟
delay = random.uniform(min_delay, max_delay)
await asyncio.sleep(delay)令牌桶限速器
令牌桶算法是一种更精确的限速方式,它允许一定程度的突发请求,同时保持长期平均速率。
import asyncio
import time
from typing import Optional
class TokenBucket:
"""
令牌桶限速器
工作原理:
- 桶有固定容量(最大令牌数)
- 以固定速率向桶中添加令牌
- 每次请求消耗一个令牌
- 桶空时请求需要等待
"""
def __init__(
self,
rate: float,
capacity: Optional[int] = None
):
"""
初始化令牌桶
Args:
rate: 每秒添加的令牌数(即每秒最多请求数)
capacity: 桶容量,默认等于 rate
"""
self.rate = rate
self.capacity = capacity or int(rate)
self.tokens = self.capacity
self.last_time = time.monotonic()
self._lock = asyncio.Lock()
async def acquire(self, tokens: int = 1) -> float:
"""
获取令牌
Args:
tokens: 需要的令牌数
Returns:
实际等待的时间
"""
async with self._lock:
now = time.monotonic()
# 计算从上次到现在应该添加的令牌数
elapsed = now - self.last_time
self.tokens = min(
self.capacity,
self.tokens + elapsed * self.rate
)
self.last_time = now
# 如果令牌不足,计算需要等待的时间
if self.tokens < tokens:
wait_time = (tokens - self.tokens) / self.rate
await asyncio.sleep(wait_time)
self.tokens = 0
return wait_time
else:
self.tokens -= tokens
return 0
async def __aenter__(self):
await self.acquire()
return self
async def __aexit__(self, *args):
pass
# 使用示例
async def crawl_with_rate_limit():
"""使用令牌桶限速"""
# 每秒最多 2 个请求
limiter = TokenBucket(rate=2.0)
urls = ["https://example.com/page/{}".format(i) for i in range(10)]
for url in urls:
async with limiter: # 自动限速
response = await fetch(url)
print(f"Fetched: {url}")使用 asyncio.Semaphore 控制并发
import asyncio
class ConcurrencyLimiter:
"""并发限制器"""
def __init__(self, max_concurrent: int = 10):
"""
初始化并发限制器
Args:
max_concurrent: 最大并发数
"""
self.semaphore = asyncio.Semaphore(max_concurrent)
async def run_with_limit(self, coro):
"""在并发限制下执行协程"""
async with self.semaphore:
return await coro
async def crawl_concurrent(urls: list, max_concurrent: int = 5):
"""
带并发限制的批量爬取
Args:
urls: URL 列表
max_concurrent: 最大并发数
"""
semaphore = asyncio.Semaphore(max_concurrent)
async def fetch_with_limit(url):
async with semaphore:
return await fetch(url)
# 并发执行,但同时最多 max_concurrent 个请求
tasks = [fetch_with_limit(url) for url in urls]
results = await asyncio.gather(*tasks, return_exceptions=True)
return resultsHTTP 错误处理
HTTP 状态码处理
在爬取网站时,需要处理各种 HTTP 状态码:
# error_handler.py - HTTP错误处理
from loguru import logger
async def handle_response(response, url: str):
"""
处理 HTTP 响应
Args:
response: HTTP 响应对象
url: 请求的 URL
"""
status_code = response.status_code
if status_code == 200:
logger.info(f"请求成功: {url}")
return response
elif status_code == 401:
logger.warning(f"需要认证: {url}")
raise Exception("需要登录或认证凭证已过期")
elif status_code == 403:
logger.error(f"禁止访问: {url}")
raise Exception("访问被禁止,请检查请求头配置")
elif status_code == 429:
logger.warning(f"触发频率限制: {url}")
raise Exception("请求过于频繁,请降低访问速率")
elif status_code == 404:
logger.warning(f"资源不存在: {url}")
return None
else:
logger.error(f"HTTP错误 {status_code}: {url}")
raise Exception(f"HTTP错误: {status_code}")实战案例:完整的请求伪装爬虫
让我们把本章学到的技术整合成一个完整的通用爬虫示例:
# -*- coding: utf-8 -*-
"""
完整的请求伪装爬虫示例
结合 UA 轮换、请求头伪装、速率控制
"""
import asyncio
import random
from typing import List, Dict, Optional
from loguru import logger
# 如果安装了 curl_cffi,优先使用
try:
from curl_cffi.requests import AsyncSession
USE_CURL_CFFI = True
except ImportError:
import httpx
USE_CURL_CFFI = False
class AntiDetectionCrawler:
"""反检测爬虫"""
DESKTOP_UAS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
]
def __init__(
self,
max_concurrent: int = 5,
min_delay: float = 1.0,
max_delay: float = 3.0
):
self.max_concurrent = max_concurrent
self.min_delay = min_delay
self.max_delay = max_delay
self.semaphore = asyncio.Semaphore(max_concurrent)
self._session = None
async def __aenter__(self):
await self.start()
return self
async def __aexit__(self, *args):
await self.close()
async def start(self):
"""启动客户端"""
if USE_CURL_CFFI:
self._session = AsyncSession(impersonate="chrome120")
else:
self._session = httpx.AsyncClient(timeout=30)
logger.info(f"客户端启动 (curl_cffi: {USE_CURL_CFFI})")
async def close(self):
"""关闭客户端"""
if self._session:
await self._session.close() if USE_CURL_CFFI else await self._session.aclose()
def _build_headers(self, referer: Optional[str] = None) -> Dict[str, str]:
"""构建请求头"""
headers = {
"User-Agent": random.choice(self.DESKTOP_UAS),
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
"Accept-Encoding": "gzip, deflate, br",
"Connection": "keep-alive",
"Upgrade-Insecure-Requests": "1",
}
if referer:
headers["Referer"] = referer
return headers
async def fetch(
self,
url: str,
referer: Optional[str] = None
) -> Optional[str]:
"""
获取页面内容
Args:
url: 目标 URL
referer: Referer 地址
Returns:
页面内容或 None
"""
async with self.semaphore:
try:
headers = self._build_headers(referer)
logger.debug(f"请求: {url}")
response = await self._session.get(url, headers=headers)
if USE_CURL_CFFI:
response.raise_for_status()
content = response.text
else:
response.raise_for_status()
content = response.text
logger.info(f"成功: {url}")
# 随机延迟
delay = random.uniform(self.min_delay, self.max_delay)
await asyncio.sleep(delay)
return content
except Exception as e:
logger.error(f"失败: {url} - {e}")
return None
async def crawl_batch(
self,
urls: List[str],
referer: Optional[str] = None
) -> List[Optional[str]]:
"""
批量爬取
Args:
urls: URL 列表
referer: Referer 地址
Returns:
内容列表
"""
tasks = [self.fetch(url, referer) for url in urls]
return await asyncio.gather(*tasks)
async def main():
"""主函数"""
logger.remove()
logger.add(lambda m: print(m, end=""), level="DEBUG")
urls = [
"https://httpbin.org/headers",
"https://httpbin.org/user-agent",
"https://httpbin.org/ip",
]
async with AntiDetectionCrawler(max_concurrent=2, min_delay=1, max_delay=2) as crawler:
results = await crawler.crawl_batch(urls)
for url, content in zip(urls, results):
if content:
print(f"\n{'='*50}")
print(f"URL: {url}")
print(content[:500])
if __name__ == "__main__":
asyncio.run(main())本章小结
本章我们学习了反爬虫对抗的基础技术——请求伪装:
- User-Agent 轮换:使用真实浏览器 UA,随机轮换避免被追踪
- 请求头完整伪装:构建与真实浏览器一致的完整请求头
- TLS 指纹模拟:使用 curl_cffi 模拟浏览器的 TLS 指纹
- 速率控制:使用随机延迟和令牌桶算法控制请求频率
- HTTP 错误处理:正确处理各种 HTTP 状态码
这些技术可以应对大部分基于请求特征的反爬检测。
下一章预告
下一章我们将学习「代理 IP 的使用与管理」。主要内容包括:
- 代理 IP 的类型和选择
- 代理池的设计与实现
- 代理的有效性检测和淘汰机制
- 代理与爬虫的集成
代理 IP 是突破 IP 封禁的重要手段,也是大规模爬虫必不可少的基础设施。