今晚跟cursor斗智斗勇一个小时,终于成功实现抓取《数据库内核月报》的需求,比起csdn过程坎坷不少,知识点也更多,也更充分体现了提示词、需求拆解和分析的紧张性。
过程太长,先简单纪录聊天过程,后续有时间再细化整理
一、 初始需求
你是一个Python爬虫专家,你需要天生一段Python代码,获取 http://mysql.taobao.org/monthly/ 中的所有文章链接,并将文档内容以markdown格式存放在当地taobao目次中。
其他留意事项:
1. 分辨页面渲染方式,避免链接抓取错误
2. 为提拔获取服从,你应该思量利用多线程及其他技术提效
3. 避免被反爬技术拦截
推荐代码如下:- import os
- import re
- import time
- import random
- import requests
- from bs4 import BeautifulSoup
- from concurrent.futures import ThreadPoolExecutor, as_completed
- BASE_URL = "http://mysql.taobao.org/monthly/"
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- MAX_RETRIES = 3
- THREADS = 8
- os.makedirs(SAVE_DIR, exist_ok=True)
- def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
- for i in range(retries):
- try:
- resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
- if resp.status_code == 200:
- return resp
- except Exception:
- if i == retries - 1:
- raise
- time.sleep(random.uniform(1, 3))
- return None
- def get_all_links():
- resp = get_with_retry(BASE_URL)
- soup = BeautifulSoup(resp.text, "lxml")
- links = []
- for a in soup.select("a"):
- href = a.get("href", "")
- # 文章链接通常形如 202x-xx.html
- if re.match(r"\d{4}-\d{2}\.html", href):
- links.append(BASE_URL + href)
- return links
- def save_article(url):
- try:
- resp = get_with_retry(url)
- soup = BeautifulSoup(resp.text, "lxml")
- # 获取文章标题
- title = soup.title.text.strip() if soup.title else url.split("/")[-1]
- # 获取正文内容
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
- # 生成markdown内容
- md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
- # 文件名
- filename = os.path.join(SAVE_DIR, url.split("/")[-1].replace(".html", ".md"))
- with open(filename, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filename}")
- # 随机延时,防止被封
- time.sleep(random.uniform(0.5, 1.5))
- except Exception as e:
- print(f"下载失败: {url},原因: {e}")
- def main():
- links = get_all_links()
- print(f"共发现{len(links)}篇文章,开始下载...")
- with ThreadPoolExecutor(max_workers=THREADS) as executor:
- futures = [executor.submit(save_article, url) for url in links]
- for future in as_completed(futures):
- pass
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码 二、 共发现0篇文章
文章获取失败,提示“共发现0篇文章,开始下载”,分析原因并优化代码实现
标题分析:
- 你的爬虫代码之以是“共发现0篇文章”,很大概是因为页面源码中的文章链接格式与你的正则表达式不符,或者页面内容被重定向、反爬等原因影响,导致没有正确抓取到链接。
优化发起:
- 调试页面源码:直接打印页面源码的前几百字符,确认文章链接的真实格式。
- 放宽正则匹配:有些链接大概不是 202x-xx.html,而是 202x-xx/ 或其他格式。
- 检查编码和反爬:有些网站会根据 User-Agent 返回差别内容,或返回乱码。
调试代码片段(发起先运行,观察输出):- import requests
- url = "http://mysql.taobao.org/monthly/"
- headers = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- resp = requests.get(url, headers=headers)
- print(resp.status_code)
- print(resp.apparent_encoding)
- print(resp.text[:1000]) # 打印前1000字符
复制代码 请运行上面代码,贴出输出内容,这样我可以帮你精准调整正则和解析逻辑,确保能抓到所有文章链接。
如果你希望我直接给出更鲁棒的爬虫代码,也可以告知你希望的策略(比如:更宽松的正则、直接抓取所有以 .html 结尾的链接等)。
结果如下
D:\book_download\venv\Scripts\python.exe D:/book_download/book_download/book/ai_get_url.py
200
utf-8
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<!-- <title>æ°æ®åºå
æ ¸ææ¥</title> -->
<meta name="description" content="æ°æ®åºå
æ ¸ææ¥, æ¥çé¿éäº PolarDB æ°æ®åºå
æ ¸å¢éã">
<link rel="stylesheet" href="/monthly/css/typo.css">
<link rel="stylesheet" href="/monthly/css/animate.css">
<link rel="stylesheet" href="/monthly/css/main.css">
<link rel="canonical" href="http://mysql.taobao.org//monthly/">
<link rel="alternate" type="application/rss+xml" title="æ°æ®åºå
æ ¸ææ¥" href="http://mysql.taobao.org//monthly/feed.xml" />
<link rel="stylesheet" href="//cdn.staticfile.org/highlight.js/8.3/styles/tomorrow.min.css">
<script src="/monthly/js/highlight.min.js"></script>
<!-- <link rel="stylesheet" href="/monthly/themes/tomorrow.css">
<script src="/monthly/highlight/highlight.pack.js"> -->
<script>hl
Process finished with exit code 0
感谢你提供的输出!
页面返回状态码 200,编码为 utf-8,且源码内容正常,但你贴出的只是前几行,还未包罗文章链接部门。
不过,这已经说明页面没有被重定向或反爬拦截。
调试代码如下:- import requests
- from bs4 import BeautifulSoup
- url = "http://mysql.taobao.org/monthly/"
- headers = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- resp = requests.get(url, headers=headers)
- soup = BeautifulSoup(resp.text, "lxml")
- # 打印所有a标签的href
- for a in soup.find_all("a"):
- print(a.get("href"))
复制代码 请运行上面代码,把输出贴出来,这样我可以帮你精准写出能抓到所有文章链接的正则和爬虫代码。
调试代码返回结果如下
/monthly/
/monthly/2025/04
/monthly/2025/03
/monthly/2025/02
/monthly/2025/01
/monthly/2024/12
/monthly/2024/11
/monthly/2024/10
...
 - import os
- import re
- import time
- import random
- import requests
- from bs4 import BeautifulSoup
- from concurrent.futures import ThreadPoolExecutor, as_completed
- BASE_URL = "http://mysql.taobao.org"
- MONTHLY_URL = BASE_URL + "/monthly/"
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- MAX_RETRIES = 3
- THREADS = 8
- os.makedirs(SAVE_DIR, exist_ok=True)
- def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
- for i in range(retries):
- try:
- resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
- if resp.status_code == 200:
- return resp
- except Exception:
- if i == retries - 1:
- raise
- time.sleep(random.uniform(1, 3))
- return None
- def get_all_article_links():
- resp = get_with_retry(MONTHLY_URL)
- soup = BeautifulSoup(resp.text, "lxml")
- links = set()
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/YYYY/MM 目录
- m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
- if m:
- links.add(BASE_URL + href)
- return sorted(links)
- def save_article(url):
- try:
- resp = get_with_retry(url)
- soup = BeautifulSoup(resp.text, "lxml")
- # 获取文章标题
- title = soup.title.text.strip() if soup.title else url.split("/")[-2] + "-" + url.split("/")[-1]
- # 获取正文内容
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
- # 生成markdown内容
- md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
- # 文件名
- filename = os.path.join(SAVE_DIR, url.rstrip("/").split("/")[-2] + "-" + url.rstrip("/").split("/")[-1] + ".md")
- with open(filename, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filename}")
- time.sleep(random.uniform(0.5, 1.5))
- except Exception as e:
- print(f"下载失败: {url},原因: {e}")
- def main():
- links = get_all_article_links()
- print(f"共发现{len(links)}篇文章,开始下载...")
- with ThreadPoolExecutor(max_workers=THREADS) as executor:
- futures = [executor.submit(save_article, url) for url in links]
- for future in as_completed(futures):
- pass
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码 三、抓取目次链接而非文章内容
有两个标题
1. 你抓取的文章中文显示有乱码
05
MySQL · 5.7ä¼å · Metadata Lockåç³»ç»çä¼å
06
MySQL · 5.7ç¹æ§ · å¨çº¿Truncate undo log 表空é´
2. 你抓取的是每个目次中的文章链接,而非目次中现实的文章内容
你的反馈非常关键,下面针对两个标题具体分析并给出解决方案:
1. 中文乱码标题
原因分析:
- 乱码通常是因为 requests 没有正确辨认网页编码,或者网页现实编码与 resp.text 的解码方式不一致。
- 该站点页面 <meta charset="utf-8">,但有时 requests 辨认不准,需强制设定编码。
解决方法:
- 在获取响应后,强制设置 resp.encoding = 'utf-8',再用 resp.text。
2. 抓取的是目次中的文章链接,而非现实文章内容
原因分析:
- 你如今抓到的是每个月的目次页(如 /monthly/2025/04),但现实内容在该目次页的正文部门(通常是 <div class="content">),而不是该页下的其他链接。
- 你只需抓取每个目次页的正文内容即可,无需再进入子链接。
- 只抓取每个 /monthly/YYYY/MM 目次页的正文内容。
- 强制设置编码,解决中文乱码。
- import os
- import re
- import time
- import random
- import requests
- from bs4 import BeautifulSoup
- from concurrent.futures import ThreadPoolExecutor, as_completed
- BASE_URL = "http://mysql.taobao.org"
- MONTHLY_URL = BASE_URL + "/monthly/"
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- MAX_RETRIES = 3
- THREADS = 8
- os.makedirs(SAVE_DIR, exist_ok=True)
- def get_with_retry(url, retries=MAX_RETRIES, **kwargs):
- for i in range(retries):
- try:
- resp = requests.get(url, headers=HEADERS, timeout=10, **kwargs)
- resp.encoding = 'utf-8' # 强制设定编码,防止中文乱码
- if resp.status_code == 200:
- return resp
- except Exception:
- if i == retries - 1:
- raise
- time.sleep(random.uniform(1, 3))
- return None
- def get_all_article_links():
- resp = get_with_retry(MONTHLY_URL)
- soup = BeautifulSoup(resp.text, "lxml")
- links = set()
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/YYYY/MM 目录
- m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
- if m:
- links.add(BASE_URL + href)
- return sorted(links)
- def save_article(url):
- try:
- resp = get_with_retry(url)
- soup = BeautifulSoup(resp.text, "lxml")
- # 获取文章标题
- title = soup.title.text.strip() if soup.title else url.split("/")[-2] + "-" + url.split("/")[-1]
- # 获取正文内容
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- if content_div:
- content = content_div.get_text("\n", strip=True)
- else:
- content = soup.get_text("\n", strip=True)
- # 生成markdown内容
- md_content = f"# {title}\n\n原文链接: [{url}]({url})\n\n---\n\n{content}"
- # 文件名
- filename = os.path.join(SAVE_DIR, url.rstrip("/").split("/")[-2] + "-" + url.rstrip("/").split("/")[-1] + ".md")
- with open(filename, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filename}")
- time.sleep(random.uniform(0.5, 1.5))
- except Exception as e:
- print(f"下载失败: {url},原因: {e}")
- def main():
- links = get_all_article_links()
- print(f"共发现{len(links)}篇文章,开始下载...")
- with ThreadPoolExecutor(max_workers=THREADS) as executor:
- futures = [executor.submit(save_article, url) for url in links]
- for future in as_completed(futures):
- pass
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码 3. 需求再拆解
中文显示标题解决了,但抓取的仍旧是目次下的文章标题而非文章内容。这样,你先尝试帮我获取 @http://mysql.taobao.org/monthly/2024/12/ 目次下的文章内容,还是以markdown格式保存 在taobao目次下
 - import os
- import requests
- from bs4 import BeautifulSoup
- url = "http://mysql.taobao.org/monthly/2024/12/"
- headers = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- os.makedirs(SAVE_DIR, exist_ok=True)
- resp = requests.get(url, headers=headers, timeout=10)
- resp.encoding = 'utf-8'
- soup = BeautifulSoup(resp.text, "lxml")
- # 获取标题
- title = soup.find("h1")
- md_content = f"# {title.text.strip() if title else '数据库内核月报'}\n\n原文链接: [{url}]({url})\n\n---\n\n"
- # 获取所有条目(通常是h3或h2标题+后续内容)
- for tag in soup.find_all(['h3', 'h2']):
- # 只抓取以“# XX”开头的条目
- if tag.text.strip().startswith("#"):
- md_content += f"## {tag.text.strip()}\n\n"
- # 收集该标题下的所有段落,直到下一个h3/h2
- for sibling in tag.find_next_siblings():
- if sibling.name in ['h3', 'h2']:
- break
- # 只抓正文段落和列表
- if sibling.name in ['p', 'ul', 'ol', 'pre', 'blockquote']:
- md_content += sibling.get_text("\n", strip=True) + "\n\n"
- filename = os.path.join(SAVE_DIR, "2024-12.md")
- with open(filename, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filename}")
复制代码 还是有标题,你并没有打开目次中的子链接,比方第一篇文章打开后是 @http://mysql.taobao.org/monthly/2024/12/01/ ,你应该逐个打开链接并获取文章内容
 - import os
- import re
- import requests
- from bs4 import BeautifulSoup
- BASE_URL = "http://mysql.taobao.org"
- MONTH_URL = "/monthly/2024/12/"
- FULL_MONTH_URL = BASE_URL + MONTH_URL
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- os.makedirs(SAVE_DIR, exist_ok=True)
- def get_soup(url):
- resp = requests.get(url, headers=HEADERS, timeout=10)
- resp.encoding = 'utf-8'
- return BeautifulSoup(resp.text, "lxml")
- def get_article_links(month_url):
- soup = get_soup(month_url)
- links = []
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/2024/12/01/ 这种子文章链接
- if re.match(r"/monthly/2024/12/\d{2}/$", href):
- # 绝对化
- if not href.startswith("http"):
- href = BASE_URL + href
- links.append(href)
- return sorted(set(links))
- def get_article_content(article_url):
- soup = get_soup(article_url)
- # 标题
- h2 = soup.find("h2")
- title = h2.text.strip() if h2 else article_url.split("/")[-2]
- # 正文
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
- return f"## {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n\n---\n"
- def main():
- article_links = get_article_links(FULL_MONTH_URL)
- print(f"共发现{len(article_links)}篇文章")
- md_content = f"# 数据库内核月报 2024/12\n\n原文目录: [{FULL_MONTH_URL}]({FULL_MONTH_URL})\n\n---\n\n"
- for link in article_links:
- print(f"抓取: {link}")
- md_content += get_article_content(link)
- filename = os.path.join(SAVE_DIR, "2024-12.md")
- with open(filename, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filename}")
- if __name__ == "__main__":
- main()
复制代码 就是这样,干得漂亮~ 但你需要将每篇文章单独存放在 .md文件里,并以文章标题作为文件名,加油
 - import os
- import re
- import requests
- from bs4 import BeautifulSoup
- BASE_URL = "http://mysql.taobao.org"
- MONTH_URL = "/monthly/2024/12/"
- FULL_MONTH_URL = BASE_URL + MONTH_URL
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- os.makedirs(SAVE_DIR, exist_ok=True)
- def safe_filename(title):
- # 只保留中英文、数字、下划线和横线
- return re.sub(r'[\\/:*?"<>|]', '', title).strip()
- def get_soup(url):
- resp = requests.get(url, headers=HEADERS, timeout=10)
- resp.encoding = 'utf-8'
- return BeautifulSoup(resp.text, "lxml")
- def get_article_links(month_url):
- soup = get_soup(month_url)
- links = []
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/2024/12/01/ 这种子文章链接
- if re.match(r"/monthly/2024/12/\d{2}/$", href):
- # 绝对化
- if not href.startswith("http"):
- href = BASE_URL + href
- links.append(href)
- return sorted(set(links))
- def save_article(article_url):
- soup = get_soup(article_url)
- # 标题
- h2 = soup.find("h2")
- title = h2.text.strip() if h2 else article_url.split("/")[-2]
- filename = safe_filename(title) + ".md"
- # 正文
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- content = content_div.get_text("\n", strip=True) if content_div else soup.get_text("\n", strip=True)
- md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
- filepath = os.path.join(SAVE_DIR, filename)
- with open(filepath, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filepath}")
- def main():
- article_links = get_article_links(FULL_MONTH_URL)
- print(f"共发现{len(article_links)}篇文章")
- for link in article_links:
- save_article(link)
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码
四、 格式化内容
内容ok,帮我再格式化一下内里的文本和代码,使其比较雅观- import os
- import re
- import requests
- from bs4 import BeautifulSoup
- BASE_URL = "http://mysql.taobao.org"
- MONTH_URL = "/monthly/2024/12/"
- FULL_MONTH_URL = BASE_URL + MONTH_URL
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- os.makedirs(SAVE_DIR, exist_ok=True)
- def safe_filename(title):
- # 只保留中英文、数字、下划线和横线
- return re.sub(r'[\\/:*?"<>|]', '', title).strip()
- def get_soup(url):
- resp = requests.get(url, headers=HEADERS, timeout=10)
- resp.encoding = 'utf-8'
- return BeautifulSoup(resp.text, "lxml")
- def get_article_links(month_url):
- soup = get_soup(month_url)
- links = []
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/2024/12/01/ 这种子文章链接
- if re.match(r"/monthly/2024/12/\d{2}/$", href):
- # 绝对化
- if not href.startswith("http"):
- href = BASE_URL + href
- links.append(href)
- return sorted(set(links))
- def format_content(content_div):
- md_lines = []
- for elem in content_div.children:
- if elem.name == 'pre':
- # 代码块
- code = elem.get_text("\n", strip=True)
- md_lines.append(f'\n```\n{code}\n```\n')
- elif elem.name in ['ul', 'ol']:
- # 列表
- for li in elem.find_all('li'):
- md_lines.append(f'- {li.get_text(strip=True)}')
- md_lines.append('')
- elif elem.name == 'blockquote':
- # 引用
- quote = elem.get_text("\n", strip=True)
- md_lines.append(f'> {quote}\n')
- elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
- # 标题
- level = int(elem.name[1])
- md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
- elif elem.name == 'p' or elem.name is None:
- # 普通段落
- text = elem.get_text("\n", strip=True)
- if text:
- md_lines.append(f'{text}\n')
- return '\n'.join(md_lines)
- def save_article(article_url):
- soup = get_soup(article_url)
- # 标题
- h2 = soup.find("h2")
- title = h2.text.strip() if h2 else article_url.split("/")[-2]
- filename = safe_filename(title) + ".md"
- # 正文
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- if content_div:
- content = format_content(content_div)
- else:
- content = soup.get_text("\n", strip=True)
- md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
- filepath = os.path.join(SAVE_DIR, filename)
- with open(filepath, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filepath}")
- def main():
- article_links = get_article_links(FULL_MONTH_URL)
- print(f"共发现{len(article_links)}篇文章")
- for link in article_links:
- save_article(link)
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码
五、 合并代码,实现初始需求
嗯,就是这样,联合以上思路,回到我们最开始的需求,帮我抓取获取 http://mysql.taobao.org/monthly/ 中的所有文章链接,并将文档内容以markdown格式存放在当地taobao目次中,其余要求不变- import os
- import re
- import time
- import random
- import requests
- from bs4 import BeautifulSoup
- from concurrent.futures import ThreadPoolExecutor, as_completed
- BASE_URL = "http://mysql.taobao.org"
- MONTHLY_URL = BASE_URL + "/monthly/"
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- THREADS = 8
- os.makedirs(SAVE_DIR, exist_ok=True)
- def safe_filename(title):
- return re.sub(r'[\\/:*?"<>|]', '', title).strip()
- def get_soup(url):
- for _ in range(3):
- try:
- resp = requests.get(url, headers=HEADERS, timeout=10)
- resp.encoding = 'utf-8'
- if resp.status_code == 200:
- return BeautifulSoup(resp.text, "lxml")
- except Exception:
- time.sleep(random.uniform(1, 2))
- return None
- def get_all_month_links():
- soup = get_soup(MONTHLY_URL)
- links = set()
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/YYYY/MM 目录
- m = re.match(r"/monthly/(\d{4})/(\d{2})$", href)
- if m:
- links.add(BASE_URL + href + "/")
- return sorted(links)
- def get_article_links(month_url):
- soup = get_soup(month_url)
- links = []
- if not soup:
- return links
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/YYYY/MM/XX/ 这种子文章链接
- m = re.match(r"/monthly/\d{4}/\d{2}/\d{2}/$", href)
- if m:
- if not href.startswith("http"):
- href = BASE_URL + href
- links.append(href)
- return sorted(set(links))
- def format_content(content_div):
- md_lines = []
- for elem in content_div.children:
- if elem.name == 'pre':
- code = elem.get_text("\n", strip=True)
- md_lines.append(f'\n```\n{code}\n```\n')
- elif elem.name in ['ul', 'ol']:
- for li in elem.find_all('li'):
- md_lines.append(f'- {li.get_text(strip=True)}')
- md_lines.append('')
- elif elem.name == 'blockquote':
- quote = elem.get_text("\n", strip=True)
- md_lines.append(f'> {quote}\n')
- elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
- level = int(elem.name[1])
- md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
- elif elem.name == 'p' or elem.name is None:
- text = elem.get_text("\n", strip=True)
- if text:
- md_lines.append(f'{text}\n')
- return '\n'.join(md_lines)
- def save_article(article_url):
- soup = get_soup(article_url)
- if not soup:
- print(f"获取失败: {article_url}")
- return
- h2 = soup.find("h2")
- title = h2.text.strip() if h2 else article_url.split("/")[-2]
- filename = safe_filename(title) + ".md"
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- if content_div:
- content = format_content(content_div)
- else:
- content = soup.get_text("\n", strip=True)
- md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
- filepath = os.path.join(SAVE_DIR, filename)
- with open(filepath, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filepath}")
- time.sleep(random.uniform(0.5, 1.2))
- def main():
- month_links = get_all_month_links()
- print(f"共发现{len(month_links)}个月份")
- all_article_links = []
- for month_url in month_links:
- article_links = get_article_links(month_url)
- print(f"{month_url} 发现{len(article_links)}篇文章")
- all_article_links.extend(article_links)
- print(f"总共发现{len(all_article_links)}篇文章,开始下载...")
- with ThreadPoolExecutor(max_workers=THREADS) as executor:
- futures = [executor.submit(save_article, url) for url in all_article_links]
- for future in as_completed(futures):
- pass
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码 六、 目次细化,抓取图片
干得漂亮~ 需求已基本实现,有一些优化项:
1. 帮我按照链接中的年份及月份创建子目次,并将对应月份的文章放入此中
2. 只需要抓取2024及2025年的文章
3. 如果文章中有表格、图片等,也需要抓取下来- import os
- import re
- import time
- import random
- import requests
- from bs4 import BeautifulSoup
- from concurrent.futures import ThreadPoolExecutor, as_completed
- from urllib.parse import urljoin, urlparse
- BASE_URL = "http://mysql.taobao.org"
- MONTHLY_URL = BASE_URL + "/monthly/"
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- THREADS = 8
- os.makedirs(SAVE_DIR, exist_ok=True)
- def safe_filename(title):
- return re.sub(r'[\\/:*?"<>|]', '', title).strip()
- def get_soup(url):
- for _ in range(3):
- try:
- resp = requests.get(url, headers=HEADERS, timeout=10)
- resp.encoding = 'utf-8'
- if resp.status_code == 200:
- return BeautifulSoup(resp.text, "lxml")
- except Exception:
- time.sleep(random.uniform(1, 2))
- return None
- def get_all_month_links():
- soup = get_soup(MONTHLY_URL)
- links = set()
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 只抓2024和2025年
- m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
- if m:
- links.add(BASE_URL + href + "/")
- return sorted(links)
- def get_article_links(month_url):
- soup = get_soup(month_url)
- links = []
- if not soup:
- return links
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/2024/12/01/ 这种子文章链接
- m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
- if m:
- if not href.startswith("http"):
- href = BASE_URL + href
- links.append(href)
- return sorted(set(links))
- def download_image(img_url, save_path):
- try:
- resp = requests.get(img_url, headers=HEADERS, timeout=10)
- if resp.status_code == 200:
- with open(save_path, 'wb') as f:
- f.write(resp.content)
- return True
- except Exception:
- pass
- return False
- def format_table(table):
- rows = table.find_all('tr')
- md = []
- for i, row in enumerate(rows):
- cols = row.find_all(['td', 'th'])
- line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
- md.append(line)
- if i == 0:
- md.append('|' + '---|' * len(cols))
- return '\n'.join(md) + '\n'
- def format_content(content_div, article_url, save_dir):
- md_lines = []
- for elem in content_div.children:
- if elem.name == 'pre':
- code = elem.get_text("\n", strip=True)
- md_lines.append(f'\n```\n{code}\n```\n')
- elif elem.name in ['ul', 'ol']:
- for li in elem.find_all('li'):
- md_lines.append(f'- {li.get_text(strip=True)}')
- md_lines.append('')
- elif elem.name == 'blockquote':
- quote = elem.get_text("\n", strip=True)
- md_lines.append(f'> {quote}\n')
- elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
- level = int(elem.name[1])
- md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
- elif elem.name == 'p' or elem.name is None:
- text = elem.get_text("\n", strip=True)
- if text:
- md_lines.append(f'{text}\n')
- elif elem.name == 'table':
- md_lines.append(format_table(elem))
- elif elem.name == 'img':
- img_src = elem.get('src')
- if img_src:
- img_url = urljoin(article_url, img_src)
- img_name = os.path.basename(urlparse(img_url).path)
- img_save_path = os.path.join(save_dir, img_name)
- if download_image(img_url, img_save_path):
- md_lines.append(f'\n')
- # 处理段落内图片
- for img in content_div.find_all('img'):
- img_src = img.get('src')
- if img_src:
- img_url = urljoin(article_url, img_src)
- img_name = os.path.basename(urlparse(img_url).path)
- img_save_path = os.path.join(save_dir, img_name)
- if download_image(img_url, img_save_path):
- # 替换原img为markdown图片语法
- img.replace_with(f'')
- return '\n'.join(md_lines)
- def save_article(article_url, year, month):
- soup = get_soup(article_url)
- if not soup:
- print(f"获取失败: {article_url}")
- return
- h2 = soup.find("h2")
- title = h2.text.strip() if h2 else article_url.split("/")[-2]
- filename = safe_filename(title) + ".md"
- # 按年份和月份创建子目录
- subdir = os.path.join(SAVE_DIR, year, month)
- os.makedirs(subdir, exist_ok=True)
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- if content_div:
- content = format_content(content_div, article_url, subdir)
- else:
- content = soup.get_text("\n", strip=True)
- md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
- filepath = os.path.join(subdir, filename)
- with open(filepath, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filepath}")
- time.sleep(random.uniform(0.5, 1.2))
- def main():
- month_links = get_all_month_links()
- print(f"共发现{len(month_links)}个月份")
- all_tasks = []
- for month_url in month_links:
- m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
- if not m:
- continue
- year, month = m.group(1), m.group(2)
- article_links = get_article_links(month_url)
- print(f"{month_url} 发现{len(article_links)}篇文章")
- for link in article_links:
- all_tasks.append((link, year, month))
- print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
- with ThreadPoolExecutor(max_workers=THREADS) as executor:
- futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
- for future in as_completed(futures):
- pass
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码 嗯,图片不要单独保存,需要嵌入回.md文档里,这样方便用户查询- import os
- import re
- import time
- import random
- import requests
- from bs4 import BeautifulSoup
- from concurrent.futures import ThreadPoolExecutor, as_completed
- from urllib.parse import urljoin
- BASE_URL = "http://mysql.taobao.org"
- MONTHLY_URL = BASE_URL + "/monthly/"
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- THREADS = 8
- os.makedirs(SAVE_DIR, exist_ok=True)
- def safe_filename(title):
- return re.sub(r'[\\/:*?"<>|]', '', title).strip()
- def get_soup(url):
- for _ in range(3):
- try:
- resp = requests.get(url, headers=HEADERS, timeout=10)
- resp.encoding = 'utf-8'
- if resp.status_code == 200:
- return BeautifulSoup(resp.text, "lxml")
- except Exception:
- time.sleep(random.uniform(1, 2))
- return None
- def get_all_month_links():
- soup = get_soup(MONTHLY_URL)
- links = set()
- for a in soup.find_all("a", href=True):
- href = a["href"]
- m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
- if m:
- links.add(BASE_URL + href + "/")
- return sorted(links)
- def get_article_links(month_url):
- soup = get_soup(month_url)
- links = []
- if not soup:
- return links
- for a in soup.find_all("a", href=True):
- href = a["href"]
- m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
- if m:
- if not href.startswith("http"):
- href = BASE_URL + href
- links.append(href)
- return sorted(set(links))
- def format_table(table):
- rows = table.find_all('tr')
- md = []
- for i, row in enumerate(rows):
- cols = row.find_all(['td', 'th'])
- line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
- md.append(line)
- if i == 0:
- md.append('|' + '---|' * len(cols))
- return '\n'.join(md) + '\n'
- def format_content(content_div, article_url):
- md_lines = []
- for elem in content_div.children:
- if elem.name == 'pre':
- code = elem.get_text("\n", strip=True)
- md_lines.append(f'\n```\n{code}\n```\n')
- elif elem.name in ['ul', 'ol']:
- for li in elem.find_all('li'):
- md_lines.append(f'- {li.get_text(strip=True)}')
- md_lines.append('')
- elif elem.name == 'blockquote':
- quote = elem.get_text("\n", strip=True)
- md_lines.append(f'> {quote}\n')
- elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
- level = int(elem.name[1])
- md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
- elif elem.name == 'p' or elem.name is None:
- text = elem.get_text("\n", strip=True)
- if text:
- md_lines.append(f'{text}\n')
- elif elem.name == 'table':
- md_lines.append(format_table(elem))
- elif elem.name == 'img':
- img_src = elem.get('src')
- if img_src:
- img_url = urljoin(article_url, img_src)
- md_lines.append(f'\n')
- # 处理段落内图片
- for img in content_div.find_all('img'):
- img_src = img.get('src')
- if img_src:
- img_url = urljoin(article_url, img_src)
- img.replace_with(f'')
- return '\n'.join(md_lines)
- def save_article(article_url, year, month):
- soup = get_soup(article_url)
- if not soup:
- print(f"获取失败: {article_url}")
- return
- h2 = soup.find("h2")
- title = h2.text.strip() if h2 else article_url.split("/")[-2]
- filename = safe_filename(title) + ".md"
- subdir = os.path.join(SAVE_DIR, year, month)
- os.makedirs(subdir, exist_ok=True)
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- if content_div:
- content = format_content(content_div, article_url)
- else:
- content = soup.get_text("\n", strip=True)
- md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
- filepath = os.path.join(subdir, filename)
- with open(filepath, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filepath}")
- time.sleep(random.uniform(0.5, 1.2))
- def main():
- month_links = get_all_month_links()
- print(f"共发现{len(month_links)}个月份")
- all_tasks = []
- for month_url in month_links:
- m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
- if not m:
- continue
- year, month = m.group(1), m.group(2)
- article_links = get_article_links(month_url)
- print(f"{month_url} 发现{len(article_links)}篇文章")
- for link in article_links:
- all_tasks.append((link, year, month))
- print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
- with ThreadPoolExecutor(max_workers=THREADS) as executor:
- futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
- for future in as_completed(futures):
- pass
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码 图片没有显示成功呢


联合你这个优化思路,再优化下之前那段代码,使抓取的图片能在.md文件中直接展示- import os
- import re
- import time
- import random
- import requests
- from bs4 import BeautifulSoup
- from concurrent.futures import ThreadPoolExecutor, as_completed
- from urllib.parse import urljoin
- BASE_URL = "http://mysql.taobao.org"
- MONTHLY_URL = BASE_URL + "/monthly/"
- HEADERS = {
- "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/123.0.0.0 Safari/537.36"
- }
- SAVE_DIR = "taobao"
- THREADS = 8
- os.makedirs(SAVE_DIR, exist_ok=True)
- def safe_filename(title):
- # 只保留中英文、数字、下划线和横线
- return re.sub(r'[\\/:*?"<>|]', '', title).strip()
- def get_soup(url):
- for _ in range(3):
- try:
- resp = requests.get(url, headers=HEADERS, timeout=10)
- resp.encoding = 'utf-8'
- if resp.status_code == 200:
- return BeautifulSoup(resp.text, "lxml")
- except Exception:
- time.sleep(random.uniform(1, 2))
- return None
- def get_all_month_links():
- soup = get_soup(MONTHLY_URL)
- links = set()
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 只抓2024和2025年
- m = re.match(r"/monthly/(2024|2025)/(\d{2})$", href)
- if m:
- links.add(BASE_URL + href + "/")
- return sorted(links)
- def get_article_links(month_url):
- soup = get_soup(month_url)
- links = []
- if not soup:
- return links
- for a in soup.find_all("a", href=True):
- href = a["href"]
- # 匹配 /monthly/2024/12/01/ 这种子文章链接
- m = re.match(r"/monthly/(2024|2025)/\d{2}/\d{2}/$", href)
- if m:
- if not href.startswith("http"):
- href = BASE_URL + href
- links.append(href)
- return sorted(set(links))
- def format_table(table):
- rows = table.find_all('tr')
- md = []
- for i, row in enumerate(rows):
- cols = row.find_all(['td', 'th'])
- line = '| ' + ' | '.join(col.get_text(strip=True) for col in cols) + ' |'
- md.append(line)
- if i == 0:
- md.append('|' + '---|' * len(cols))
- return '\n'.join(md) + '\n'
- def format_content(content_div, article_url):
- md_lines = []
- for elem in content_div.children:
- if elem.name == 'pre':
- code = elem.get_text("\n", strip=True)
- md_lines.append(f'\n```\n{code}\n```\n')
- elif elem.name in ['ul', 'ol']:
- for li in elem.find_all('li', recursive=False):
- # 处理列表项中的图片
- imgs = li.find_all('img')
- for img in imgs:
- img_src = img.get('src')
- if img_src:
- img_url = urljoin(article_url, img_src)
- md_lines.append(f'')
- text = li.get_text(" ", strip=True)
- if text:
- md_lines.append(f'- {text}')
- md_lines.append('')
- elif elem.name == 'blockquote':
- quote = elem.get_text("\n", strip=True)
- md_lines.append(f'> {quote}\n')
- elif elem.name in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
- level = int(elem.name[1])
- md_lines.append(f'\n{"#"*level} {elem.get_text(strip=True)}\n')
- elif elem.name == 'p' or elem.name is None:
- # 处理段落中的图片
- if hasattr(elem, 'find_all'):
- imgs = elem.find_all('img')
- for img in imgs:
- img_src = img.get('src')
- if img_src:
- img_url = urljoin(article_url, img_src)
- md_lines.append(f'')
- text = elem.get_text("\n", strip=True)
- if text:
- md_lines.append(f'{text}\n')
- elif elem.name == 'table':
- md_lines.append(format_table(elem))
- return '\n'.join(md_lines)
- def save_article(article_url, year, month):
- soup = get_soup(article_url)
- if not soup:
- print(f"获取失败: {article_url}")
- return
- h2 = soup.find("h2")
- title = h2.text.strip() if h2 else article_url.split("/")[-2]
- filename = safe_filename(title) + ".md"
- # 按年份和月份创建子目录
- subdir = os.path.join(SAVE_DIR, year, month)
- os.makedirs(subdir, exist_ok=True)
- content_div = soup.find("div", class_="content")
- if not content_div:
- content_div = soup.find("div", id="content")
- if content_div:
- content = format_content(content_div, article_url)
- else:
- content = soup.get_text("\n", strip=True)
- md_content = f"# {title}\n\n原文链接: [{article_url}]({article_url})\n\n{content}\n"
- filepath = os.path.join(subdir, filename)
- with open(filepath, "w", encoding="utf-8") as f:
- f.write(md_content)
- print(f"已保存: {filepath}")
- time.sleep(random.uniform(0.5, 1.2))
- def main():
- month_links = get_all_month_links()
- print(f"共发现{len(month_links)}个月份")
- all_tasks = []
- for month_url in month_links:
- m = re.match(r".*/monthly/(2024|2025)/(\d{2})/", month_url)
- if not m:
- continue
- year, month = m.group(1), m.group(2)
- article_links = get_article_links(month_url)
- print(f"{month_url} 发现{len(article_links)}篇文章")
- for link in article_links:
- all_tasks.append((link, year, month))
- print(f"总共发现{len(all_tasks)}篇文章,开始下载...")
- with ThreadPoolExecutor(max_workers=THREADS) as executor:
- futures = [executor.submit(save_article, url, year, month) for url, year, month in all_tasks]
- for future in as_completed(futures):
- pass
- print("全部下载完成。")
- if __name__ == "__main__":
- main()
复制代码 终于终于,需求实现~
七、 实行结果
抓取成功~
文章结果
八、 最佳提示词总结
总结我们的对话,我应该给你什么样的提示词,能让你在第一次就天生这段目标代码呢
最佳提示词(Prompt)示例
你是一个Python爬虫专家。请帮我用Python编写一个高效的爬虫,要求如下:
- 目标网站:抓取 http://mysql.taobao.org/monthly/ 下2024和2025年所有月报文章。
- 目次结构:自动辨认所有2024和2025年下的月份和文章链接,并按“年份/月份”创建当地子目次(如 taobao/2024/12/)。
- 内容保存:每篇文章内容以 markdown 格式单独保存为 .md 文件,文件名为文章标题,放在对应的子目次下。
- 内容格式:正文内容需雅观排版,支持代码块、表格(转为markdown表格)、列表、引用等常见格式。
- 图片处理:文章中的图片在 markdown 文件中以原始网络URL嵌入(即 ),无需下载到当地,确保在md预览时可直接显示。
- 服从与反爬:支持多线程下载,设置User-Agent、随机延时、自动重试等反爬步伐。
- 健壮性:遇到网络异常、页面结构变化等情况能自动跳过并继续。
请直接给出完整、可运行的Python代码。
这样写的好处
- 明确了目标范围(2024/2025年、所有月报、所有文章)
你可以根据现实需求增减细节,但只要覆盖上述要点,基本可以一次性获得你想要的高质量爬虫代码!
九、 知识点学习
为了实现这个需求,你的终极代码天生思路是什么样的呢,帮助一步步列出。这段代码中的重点和难点分别是什么,涉及到哪些知识点?帮助联合代码分析,我不太清晰你的步骤对应代码中的哪些内容
1. 步骤分解与代码对应
2. 重点与难点代码分析
重点1:高质量内容解析与格式化
- format_content 函数负责将HTML正文转为雅观的Markdown,兼容各种标签(段落、标题、代码块、表格、图片等)。
- format_table 函数将HTML表格转为Markdown表格。
难点1:图片URL处理
- 用 urljoin(article_url, img_src) 保证图片URL完整,插入 ,确保md中能直接显示图片。
难点2:多线程与异常处理
- ThreadPoolExecutor 并发抓取,get_soup 内部自动重试,保证大批量抓取时的健壮性。
难点3:文件名正当性
- safe_filename 函数去除标题中的非法字符,避免文件保存出错。
3. 涉及知识点与代码位置
- requests/BeautifulSoup:所有网页哀求与HTML解析相干函数
- Markdown语法:format_content、format_table等
4. 总结
- 你可以对照每个函数和代码块,明白它在整个流程中的作用。
- 重点和难点都在内容解析、图片处理、多线程和健壮性上,代码中都有明确实现。
免责声明:如果侵犯了您的权益,请联系站长,我们会及时删除侵权内容,谢谢合作!更多信息从访问主页:qidao123.com:ToB企服之家,中国第一个企服评测及商务社交产业平台。 |