前因

该死的抖音小说推广,每每看到高潮结束,然后就是各种广告和软件下载。

程序员能受这气?直接开搞。

后果

  1. 找书源
  2. 爬取小说内容
  3. 文字转语音

首先找一个书源

笔趣阁(很全但是有”DDoS protection by Cloudflare”)

铅笔小说 没有很多权限,很适合我这种半吊子选手

开爬

  1. 首先看url是否正确读取
1
2
3
4
5
6
7
8
9
10
import urllib.request as ur

url = "https://www.23qb.net/book/232488/90659053.html"

req = ur.Request(url)
req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.116 Safari/537.36")

response = ur.urlopen(req)
html = response.read()
print(html) # 正确输出网页数据
  1. 解析小说文章内容把多余的去掉

检查网页发现内容是一个一个P标签构成的

我们可以找p标签的开头和结尾,发现开头都是toolbarm()以这个函数开始,结束都有一句铅笔小说 23qb.net广告

1
2
3
4
5
6
7
8
9
10
11
12
13
content = []
start = html.find("toolbarm();", 0, -1)
end = html.find("铅笔小说", start, -1)

a = html.find("<p>", start, end)
while a < end:
b = html.find("</p>", a, -1)
if b < end:
content.append(html[a + 3: b])
else:
break
a = html.find("<p>", b + 3, -1)
print(content)
  1. 每次自能保存一章继续找找有没有下一章的连接
1
2
3
4
# 网页中存在url_next 那就取出即可
a = html.find("url_next:", 0, -1)
b = html.find(",", a, a + 128)
print(html[a + 10: b - 1])
  1. 保存文件
1
2
3
4
5
with open(file_name, 'w', encoding="UTF-8") as f:
f.write(title + "\n\n")
for content in content_list:
f.write(" " + content + "\n\n")
f.close()

献上完整代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
import urllib.request as ur
import os
import easygui as e
import time


def url_open(url):
time.sleep(0.1)
req = ur.Request(url)
req.add_header("User-Agent", "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/80.0.3987.116 Safari/537.36")

response = ur.urlopen(req)
html = response.read()

return html


def find_next(html):
a = html.find("url_next:", 0, -1)
b = html.find(",", a, a + 128)
return html[a + 10: b - 1]


def find_title(html):
a = html.find("mlfy_main_text")
b = html.find("</h1>", a, a + 255)
title = html[a + 24: b]
return title


def find_content(html):
content = []
start = html.find("toolbarm();", 0, -1)
end = html.find("铅笔小说", start, -1)

a = html.find("<p>", start, end)
while a < end:
b = html.find("</p>", a, -1)
if b < end:
content.append(html[a + 3: b])
else:
break
a = html.find("<p>", b + 3, -1)
return content


def save_file(file_name, title, content_list):
with open(file_name, 'w', encoding="UTF-8") as f:
f.write(title + "\n\n")
for content in content_list:
f.write(" " + content + "\n\n")
f.close()


if __name__ == "__main__":
# html = url_open("https://www.23qb.net/book/232488/90659053.html").decode("gbk")
url_next = "/book/232488/92437490.html"
while url_next != "/book/232488/92437636.html":
html = url_open("https://www.23qb.net" + url_next).decode("gbk")
# 找到标题
my_title = find_title(html)
print(my_title)
# 找到内容
my_contents = find_content(html)
# 保存成txt
save_file("./content/" + str(my_title).replace('/', '') + ".txt", my_title, my_contents)
print("保存完毕")
# 找到下个网址
url_next = find_next(html)

运行结果

运行实例

接下来就是文字转语音

提供一个开源软件内置了微软文字转语音接口~空降官网 仓库链接

运行示例

image-20230825145456489

右方可自行配置

结束