Skip to content

反爬虫对抗基础:请求伪装

当你开始爬取一些有价值的网站时,很快就会发现:不是所有网站都欢迎爬虫。反爬虫与反反爬虫的对抗,是爬虫工程师必须面对的课题。本章我们将学习最基础也最重要的反爬对抗技术——请求伪装。

学习目标:掌握 User-Agent 轮换、请求头伪装、TLS 指纹模拟和速率控制等核心技术,让你的爬虫不易被识别。

反爬虫机制概述

反爬检测流程

为什么网站要反爬虫

在开始学习反爬技术之前,我们需要理解网站为什么要反爬虫:

  1. 保护数据资产:数据是有价值的,网站不希望被批量获取
  2. 保护服务器资源:爬虫会消耗服务器带宽和计算资源
  3. 防止恶意行为:如价格监控、竞品分析、数据倒卖等
  4. 合规要求:某些数据有法律保护要求

常见的反爬虫检测手段

检测类型检测方式难度常见应对
请求特征检测User-Agent、请求头完整性伪装请求头
行为特征检测访问频率、访问路径速率控制、随机延迟
API签名检测参数签名验证逆向签名算法
Cookie检测登录状态验证登录获取Cookie
浏览器指纹检测JS 环境、Canvas、WebGL使用真实浏览器
验证码检测图片验证码、滑块验证码OCR、打码平台

本章主要聚焦于请求特征检测的对抗,这是最基础的反爬手段。


User-Agent 策略

什么是 User-Agent

User-Agent(简称 UA)是 HTTP 请求头中的一个字段,用于标识发起请求的客户端类型。例如:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36

这个字符串包含了:

  • 浏览器类型(Chrome)
  • 浏览器版本(120.0.0.0)
  • 操作系统(Windows 10 64位)
  • 渲染引擎(AppleWebKit/537.36)

为什么要轮换 User-Agent

如果你的爬虫始终使用同一个 UA,会有以下风险:

  1. 特征明显:Python 默认的 UA 是 python-requests/2.x.x,一眼就能识别
  2. 容易被追踪:同一 UA 的大量请求会被关联分析
  3. 容易被封禁:一旦被识别,可以按 UA 封禁

实现 User-Agent 轮换

方法一:手动维护 UA 列表

python
import random

# 桌面浏览器 UA 列表
DESKTOP_USER_AGENTS = [
    # Chrome Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    # Chrome Mac
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
    # Firefox Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    # Firefox Mac
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) Gecko/20100101 Firefox/121.0",
    # Safari Mac
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Safari/605.1.15",
    # Edge Windows
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0",
]

# 移动端 UA 列表
MOBILE_USER_AGENTS = [
    # iPhone Safari
    "Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.2 Mobile/15E148 Safari/604.1",
    # Android Chrome
    "Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36",
    # Android Samsung
    "Mozilla/5.0 (Linux; Android 14; SM-S918B) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Mobile Safari/537.36",
]


def get_random_ua(mobile: bool = False) -> str:
    """获取随机 User-Agent"""
    ua_list = MOBILE_USER_AGENTS if mobile else DESKTOP_USER_AGENTS
    return random.choice(ua_list)

方法二:使用 fake-useragent 库

bash
pip install fake-useragent
python
from fake_useragent import UserAgent

# 创建 UserAgent 对象
ua = UserAgent()

# 获取随机 UA
print(ua.random)     # 完全随机
print(ua.chrome)     # 随机 Chrome UA
print(ua.firefox)    # 随机 Firefox UA
print(ua.safari)     # 随机 Safari UA

实现 UA 轮换器

python
import random
from typing import List, Optional
from fake_useragent import UserAgent


class UARotator:
    """User-Agent 轮换器"""

    def __init__(self, use_fake_ua: bool = True, custom_uas: Optional[List[str]] = None):
        """
        初始化 UA 轮换器

        Args:
            use_fake_ua: 是否使用 fake-useragent 库
            custom_uas: 自定义 UA 列表
        """
        self.use_fake_ua = use_fake_ua
        self.custom_uas = custom_uas or []

        if use_fake_ua:
            try:
                self._fake_ua = UserAgent()
            except Exception:
                self.use_fake_ua = False

    def get_random(self) -> str:
        """获取随机 UA"""
        # 优先使用自定义列表
        if self.custom_uas:
            return random.choice(self.custom_uas)

        # 使用 fake-useragent
        if self.use_fake_ua:
            return self._fake_ua.random

        # 默认返回 Chrome UA
        return (
            "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
            "AppleWebKit/537.36 (KHTML, like Gecko) "
            "Chrome/120.0.0.0 Safari/537.36"
        )

    def get_chrome(self) -> str:
        """获取 Chrome UA"""
        if self.use_fake_ua:
            return self._fake_ua.chrome
        return self.get_random()

    def get_mobile(self) -> str:
        """获取移动端 UA"""
        mobile_uas = [
            "Mozilla/5.0 (iPhone; CPU iPhone OS 17_2 like Mac OS X) AppleWebKit/605.1.15",
            "Mozilla/5.0 (Linux; Android 14; Pixel 8) AppleWebKit/537.36",
        ]
        return random.choice(mobile_uas)

请求头完整伪装

为什么仅有 User-Agent 不够

很多网站不仅检测 UA,还会检测其他请求头字段。真实浏览器的请求头是非常丰富的:

http
GET /api/data HTTP/1.1
Host: www.example.com
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8
Accept-Encoding: gzip, deflate, br
Connection: keep-alive
Upgrade-Insecure-Requests: 1
Sec-Fetch-Dest: document
Sec-Fetch-Mode: navigate
Sec-Fetch-Site: none
Sec-Fetch-User: ?1
Cache-Control: max-age=0

而 Python 默认的请求头非常简陋,很容易被识别。

构建完整的请求头

python
from typing import Dict, Optional


class HeadersBuilder:
    """请求头构建器"""

    # 基础请求头模板
    BASE_HEADERS = {
        "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Upgrade-Insecure-Requests": "1",
        "Sec-Fetch-Dest": "document",
        "Sec-Fetch-Mode": "navigate",
        "Sec-Fetch-Site": "none",
        "Sec-Fetch-User": "?1",
        "Cache-Control": "max-age=0",
    }

    # API 请求头模板
    API_HEADERS = {
        "Accept": "application/json, text/plain, */*",
        "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
        "Accept-Encoding": "gzip, deflate, br",
        "Connection": "keep-alive",
        "Sec-Fetch-Dest": "empty",
        "Sec-Fetch-Mode": "cors",
        "Sec-Fetch-Site": "same-origin",
    }

    def __init__(self, ua_rotator: Optional['UARotator'] = None):
        self.ua_rotator = ua_rotator or UARotator()

    def build(
        self,
        referer: Optional[str] = None,
        origin: Optional[str] = None,
        extra_headers: Optional[Dict[str, str]] = None,
        is_api: bool = False
    ) -> Dict[str, str]:
        """
        构建请求头

        Args:
            referer: Referer 地址
            origin: Origin 地址
            extra_headers: 额外的请求头
            is_api: 是否是 API 请求

        Returns:
            完整的请求头字典
        """
        # 选择基础模板
        headers = self.API_HEADERS.copy() if is_api else self.BASE_HEADERS.copy()

        # 添加 User-Agent
        headers["User-Agent"] = self.ua_rotator.get_random()

        # 添加 Referer
        if referer:
            headers["Referer"] = referer
            # 如果有 Referer,通常 Sec-Fetch-Site 应该是 same-origin 或 cross-site
            headers["Sec-Fetch-Site"] = "same-origin"

        # 添加 Origin(通常用于 POST 请求或 CORS 请求)
        if origin:
            headers["Origin"] = origin

        # 合并额外请求头
        if extra_headers:
            headers.update(extra_headers)

        return headers

    def build_for_ajax(
        self,
        referer: str,
        x_requested_with: bool = True
    ) -> Dict[str, str]:
        """
        构建 AJAX 请求头

        Args:
            referer: Referer 地址
            x_requested_with: 是否添加 X-Requested-With 头

        Returns:
            AJAX 请求头
        """
        headers = self.build(referer=referer, is_api=True)

        if x_requested_with:
            headers["X-Requested-With"] = "XMLHttpRequest"

        return headers

Referer 的正确设置

Referer 表示当前请求是从哪个页面发起的。正确设置 Referer 很重要:

python
async def crawl_with_referer(client, list_url: str, detail_urls: list):
    """演示正确设置 Referer"""
    headers_builder = HeadersBuilder()

    # 访问列表页
    list_headers = headers_builder.build()
    response = await client.get(list_url, headers=list_headers)

    # 访问详情页时,Referer 应该是列表页
    for detail_url in detail_urls:
        detail_headers = headers_builder.build(referer=list_url)
        response = await client.get(detail_url, headers=detail_headers)

请求头实战配置

以下是一个通用的请求头构建器,可以根据目标网站进行定制:

python
# headers_builder.py - 通用请求头构建器
"""通用请求头配置"""

class SiteHeadersBuilder:
    """站点请求头构建器"""

    def __init__(self, base_url: str):
        """
        初始化构建器

        Args:
            base_url: 目标网站基础URL
        """
        self.base_url = base_url
        self.base_headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
                          "AppleWebKit/537.36 (KHTML, like Gecko) "
                          "Chrome/120.0.0.0 Safari/537.36",
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
        }

    def build_for_page(self, referer: str = "") -> dict:
        """
        构建页面请求头

        Args:
            referer: Referer 地址

        Returns:
            完整的请求头字典
        """
        headers = self.base_headers.copy()
        if referer:
            headers["Referer"] = referer
            headers["Sec-Fetch-Site"] = "same-origin"
        return headers

    def build_for_api(self, referer: str = "") -> dict:
        """构建 API 请求头"""
        headers = self.base_headers.copy()
        headers["Accept"] = "application/json, text/plain, */*"
        headers["Sec-Fetch-Dest"] = "empty"
        headers["Sec-Fetch-Mode"] = "cors"
        if referer:
            headers["Referer"] = referer
            headers["Sec-Fetch-Site"] = "same-origin"
        return headers

请求头配置要点

  1. Referer:很多网站会检查,应设置为合理的来源页面
  2. Sec-Fetch 系列:现代浏览器标准头,建议保持完整
  3. Accept:页面请求和 API 请求的 Accept 头不同

使用 curl_cffi 模拟浏览器指纹

什么是 TLS 指纹

除了 HTTP 请求头,服务器还可以通过 TLS(HTTPS 握手)的特征来识别客户端。不同的客户端(浏览器、Python requests、curl 等)在 TLS 握手时会展现不同的特征:

  • 支持的加密套件顺序
  • 支持的 TLS 扩展
  • 椭圆曲线参数

Python 的 requestshttpx 使用的 TLS 指纹与真实浏览器差异很大,容易被识别。

curl_cffi 简介

curl_cffi 是一个可以模拟各种浏览器 TLS 指纹的 HTTP 客户端库。

安装:

bash
pip install curl_cffi

基本使用

python
from curl_cffi import requests

# 模拟 Chrome 浏览器
response = requests.get(
    "https://tls.browserleaks.com/json",
    impersonate="chrome120"  # 模拟 Chrome 120
)
print(response.json())

# 支持的浏览器指纹
# chrome99, chrome100, chrome101, ..., chrome120
# chrome99_android
# edge99, edge101
# safari15_3, safari15_5, safari17_0

异步使用

python
from curl_cffi.requests import AsyncSession

async def fetch_with_curl_cffi():
    """使用 curl_cffi 的异步请求"""
    async with AsyncSession(impersonate="chrome120") as session:
        response = await session.get("https://httpbin.org/headers")
        print(response.json())

封装 curl_cffi 客户端

python
from curl_cffi.requests import AsyncSession
from typing import Optional, Dict, Any
import random


class BrowserClient:
    """
    模拟浏览器的 HTTP 客户端

    使用 curl_cffi 模拟真实浏览器的 TLS 指纹
    """

    BROWSER_VERSIONS = [
        "chrome119",
        "chrome120",
        "edge99",
        "edge101",
        "safari15_5",
        "safari17_0",
    ]

    def __init__(
        self,
        impersonate: Optional[str] = None,
        proxy: Optional[str] = None,
        timeout: int = 30
    ):
        """
        初始化客户端

        Args:
            impersonate: 模拟的浏览器,如 "chrome120",None 表示随机
            proxy: 代理地址
            timeout: 超时时间
        """
        self.impersonate = impersonate
        self.proxy = proxy
        self.timeout = timeout
        self._session: Optional[AsyncSession] = None

    async def __aenter__(self):
        await self.start()
        return self

    async def __aexit__(self, *args):
        await self.close()

    async def start(self):
        """启动会话"""
        browser = self.impersonate or random.choice(self.BROWSER_VERSIONS)
        self._session = AsyncSession(
            impersonate=browser,
            proxy=self.proxy,
            timeout=self.timeout
        )

    async def close(self):
        """关闭会话"""
        if self._session:
            await self._session.close()
            self._session = None

    async def get(
        self,
        url: str,
        headers: Optional[Dict[str, str]] = None,
        **kwargs
    ) -> Any:
        """发送 GET 请求"""
        if not self._session:
            await self.start()
        return await self._session.get(url, headers=headers, **kwargs)

    async def post(
        self,
        url: str,
        data: Optional[Dict] = None,
        json: Optional[Dict] = None,
        headers: Optional[Dict[str, str]] = None,
        **kwargs
    ) -> Any:
        """发送 POST 请求"""
        if not self._session:
            await self.start()
        return await self._session.post(
            url, data=data, json=json, headers=headers, **kwargs
        )

速率控制

为什么需要速率控制

即使你的请求伪装得再好,如果访问频率过高,也会触发反爬机制。人类的浏览行为是有一定节奏的,而机器的请求往往过于规律或过于密集。

基本延迟策略

python
import asyncio
import random


async def crawl_with_delay(urls: list, min_delay: float = 1.0, max_delay: float = 3.0):
    """
    带随机延迟的爬取

    Args:
        urls: URL 列表
        min_delay: 最小延迟(秒)
        max_delay: 最大延迟(秒)
    """
    for url in urls:
        # 发送请求
        response = await fetch(url)

        # 随机延迟
        delay = random.uniform(min_delay, max_delay)
        await asyncio.sleep(delay)

令牌桶限速器

令牌桶算法是一种更精确的限速方式,它允许一定程度的突发请求,同时保持长期平均速率。

python
import asyncio
import time
from typing import Optional


class TokenBucket:
    """
    令牌桶限速器

    工作原理:
    - 桶有固定容量(最大令牌数)
    - 以固定速率向桶中添加令牌
    - 每次请求消耗一个令牌
    - 桶空时请求需要等待
    """

    def __init__(
        self,
        rate: float,
        capacity: Optional[int] = None
    ):
        """
        初始化令牌桶

        Args:
            rate: 每秒添加的令牌数(即每秒最多请求数)
            capacity: 桶容量,默认等于 rate
        """
        self.rate = rate
        self.capacity = capacity or int(rate)
        self.tokens = self.capacity
        self.last_time = time.monotonic()
        self._lock = asyncio.Lock()

    async def acquire(self, tokens: int = 1) -> float:
        """
        获取令牌

        Args:
            tokens: 需要的令牌数

        Returns:
            实际等待的时间
        """
        async with self._lock:
            now = time.monotonic()

            # 计算从上次到现在应该添加的令牌数
            elapsed = now - self.last_time
            self.tokens = min(
                self.capacity,
                self.tokens + elapsed * self.rate
            )
            self.last_time = now

            # 如果令牌不足,计算需要等待的时间
            if self.tokens < tokens:
                wait_time = (tokens - self.tokens) / self.rate
                await asyncio.sleep(wait_time)
                self.tokens = 0
                return wait_time
            else:
                self.tokens -= tokens
                return 0

    async def __aenter__(self):
        await self.acquire()
        return self

    async def __aexit__(self, *args):
        pass


# 使用示例
async def crawl_with_rate_limit():
    """使用令牌桶限速"""
    # 每秒最多 2 个请求
    limiter = TokenBucket(rate=2.0)

    urls = ["https://example.com/page/{}".format(i) for i in range(10)]

    for url in urls:
        async with limiter:  # 自动限速
            response = await fetch(url)
            print(f"Fetched: {url}")

使用 asyncio.Semaphore 控制并发

python
import asyncio


class ConcurrencyLimiter:
    """并发限制器"""

    def __init__(self, max_concurrent: int = 10):
        """
        初始化并发限制器

        Args:
            max_concurrent: 最大并发数
        """
        self.semaphore = asyncio.Semaphore(max_concurrent)

    async def run_with_limit(self, coro):
        """在并发限制下执行协程"""
        async with self.semaphore:
            return await coro


async def crawl_concurrent(urls: list, max_concurrent: int = 5):
    """
    带并发限制的批量爬取

    Args:
        urls: URL 列表
        max_concurrent: 最大并发数
    """
    semaphore = asyncio.Semaphore(max_concurrent)

    async def fetch_with_limit(url):
        async with semaphore:
            return await fetch(url)

    # 并发执行,但同时最多 max_concurrent 个请求
    tasks = [fetch_with_limit(url) for url in urls]
    results = await asyncio.gather(*tasks, return_exceptions=True)

    return results

HTTP 错误处理

HTTP 状态码处理

在爬取网站时,需要处理各种 HTTP 状态码:

python
# error_handler.py - HTTP错误处理
from loguru import logger


async def handle_response(response, url: str):
    """
    处理 HTTP 响应

    Args:
        response: HTTP 响应对象
        url: 请求的 URL
    """
    status_code = response.status_code

    if status_code == 200:
        logger.info(f"请求成功: {url}")
        return response

    elif status_code == 401:
        logger.warning(f"需要认证: {url}")
        raise Exception("需要登录或认证凭证已过期")

    elif status_code == 403:
        logger.error(f"禁止访问: {url}")
        raise Exception("访问被禁止,请检查请求头配置")

    elif status_code == 429:
        logger.warning(f"触发频率限制: {url}")
        raise Exception("请求过于频繁,请降低访问速率")

    elif status_code == 404:
        logger.warning(f"资源不存在: {url}")
        return None

    else:
        logger.error(f"HTTP错误 {status_code}: {url}")
        raise Exception(f"HTTP错误: {status_code}")

实战案例:完整的请求伪装爬虫

让我们把本章学到的技术整合成一个完整的通用爬虫示例:

python
# -*- coding: utf-8 -*-
"""
完整的请求伪装爬虫示例
结合 UA 轮换、请求头伪装、速率控制
"""

import asyncio
import random
from typing import List, Dict, Optional
from loguru import logger

# 如果安装了 curl_cffi,优先使用
try:
    from curl_cffi.requests import AsyncSession
    USE_CURL_CFFI = True
except ImportError:
    import httpx
    USE_CURL_CFFI = False


class AntiDetectionCrawler:
    """反检测爬虫"""

    DESKTOP_UAS = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) Gecko/20100101 Firefox/121.0",
    ]

    def __init__(
        self,
        max_concurrent: int = 5,
        min_delay: float = 1.0,
        max_delay: float = 3.0
    ):
        self.max_concurrent = max_concurrent
        self.min_delay = min_delay
        self.max_delay = max_delay
        self.semaphore = asyncio.Semaphore(max_concurrent)
        self._session = None

    async def __aenter__(self):
        await self.start()
        return self

    async def __aexit__(self, *args):
        await self.close()

    async def start(self):
        """启动客户端"""
        if USE_CURL_CFFI:
            self._session = AsyncSession(impersonate="chrome120")
        else:
            self._session = httpx.AsyncClient(timeout=30)
        logger.info(f"客户端启动 (curl_cffi: {USE_CURL_CFFI})")

    async def close(self):
        """关闭客户端"""
        if self._session:
            await self._session.close() if USE_CURL_CFFI else await self._session.aclose()

    def _build_headers(self, referer: Optional[str] = None) -> Dict[str, str]:
        """构建请求头"""
        headers = {
            "User-Agent": random.choice(self.DESKTOP_UAS),
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
            "Accept-Language": "zh-CN,zh;q=0.9,en;q=0.8",
            "Accept-Encoding": "gzip, deflate, br",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
        }
        if referer:
            headers["Referer"] = referer
        return headers

    async def fetch(
        self,
        url: str,
        referer: Optional[str] = None
    ) -> Optional[str]:
        """
        获取页面内容

        Args:
            url: 目标 URL
            referer: Referer 地址

        Returns:
            页面内容或 None
        """
        async with self.semaphore:
            try:
                headers = self._build_headers(referer)

                logger.debug(f"请求: {url}")
                response = await self._session.get(url, headers=headers)

                if USE_CURL_CFFI:
                    response.raise_for_status()
                    content = response.text
                else:
                    response.raise_for_status()
                    content = response.text

                logger.info(f"成功: {url}")

                # 随机延迟
                delay = random.uniform(self.min_delay, self.max_delay)
                await asyncio.sleep(delay)

                return content

            except Exception as e:
                logger.error(f"失败: {url} - {e}")
                return None

    async def crawl_batch(
        self,
        urls: List[str],
        referer: Optional[str] = None
    ) -> List[Optional[str]]:
        """
        批量爬取

        Args:
            urls: URL 列表
            referer: Referer 地址

        Returns:
            内容列表
        """
        tasks = [self.fetch(url, referer) for url in urls]
        return await asyncio.gather(*tasks)


async def main():
    """主函数"""
    logger.remove()
    logger.add(lambda m: print(m, end=""), level="DEBUG")

    urls = [
        "https://httpbin.org/headers",
        "https://httpbin.org/user-agent",
        "https://httpbin.org/ip",
    ]

    async with AntiDetectionCrawler(max_concurrent=2, min_delay=1, max_delay=2) as crawler:
        results = await crawler.crawl_batch(urls)
        for url, content in zip(urls, results):
            if content:
                print(f"\n{'='*50}")
                print(f"URL: {url}")
                print(content[:500])


if __name__ == "__main__":
    asyncio.run(main())

本章小结

本章我们学习了反爬虫对抗的基础技术——请求伪装:

  1. User-Agent 轮换:使用真实浏览器 UA,随机轮换避免被追踪
  2. 请求头完整伪装:构建与真实浏览器一致的完整请求头
  3. TLS 指纹模拟:使用 curl_cffi 模拟浏览器的 TLS 指纹
  4. 速率控制:使用随机延迟和令牌桶算法控制请求频率
  5. HTTP 错误处理:正确处理各种 HTTP 状态码

这些技术可以应对大部分基于请求特征的反爬检测。


下一章预告

下一章我们将学习「代理 IP 的使用与管理」。主要内容包括:

  • 代理 IP 的类型和选择
  • 代理池的设计与实现
  • 代理的有效性检测和淘汰机制
  • 代理与爬虫的集成

代理 IP 是突破 IP 封禁的重要手段,也是大规模爬虫必不可少的基础设施。