Python实现网站邮箱爬取

Tengpaz Lv3

网络爬虫之邮箱爬取

第一步 编程环境准备

Windows

Python的安装

检查你的系统是否安装了Python

在开始菜单中搜索cmd打开命令提示符,或者按住Shift并右击桌面选择在终端中打开

在终端窗口中输入python并按下回车键Enter

  • 如果出现了Python提示符>>>,就说明你的系统安装了Python
  • 如果看到一条错误信息说python是无法识别的指令那么就说明你的系统里还没有安装Python
安装Python

有些Windows系统中不会默认安装Python,比如我的。。。所以你需要安装Python

这是python官网的连接:https://www.python.org/

因为我使用的是windows11,所以对于win7以及更早的版本没有什么很好的建议,如果你是这一类的用户,可以参见官网的指导

python官网

直接点击python 3.12.0下载最新版即可,如果由于文章未更新而最新版本信息未匹配,请下载官网提供的最新版即可

运行安装程序时,请勾选Add Python to PATH,否则你的配置会麻烦许多

Linux

Python的安装

Linux系统是专门为编程而设计的,在绝大多数Linux系统中都默认安装了Python,所以你基本不需要安装什么软件或修改什么设置

代码

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
import sys
import getopt
import requests
from bs4 import BeautifulSoup
import re

#主函数,传入用户输入的参数
def start(argv):
url = ""
pages = ""
if len(sys.argv) < 2:
print("-h 帮助信息;\n")
sys.exit()
#定义异常处理
try:
opts,args = getopt.getopt(argv,"-u:-p:-h")
except getopt.GetoptError:
print('Error an argument!')
sys.exit()
for opt,arg in opts:
if opt == "-u":
url = arg
elif opt == "-p":
pages = arg
elif opt == "-h":
print(usage())

launcher(url,pages)

#使用规则
def usage():
print('-h: --help 帮助;')
print('-u: --url 域名;')
print('-p: --pages 页数;')
print('eg: python -u "www.baidu.com" -p 100' + '\n')
sys.exit()

#漏洞回调函数
def launcher(url,pages): #调用bing_search()和baidu_search()函数并且将bing爬到的和baidu爬到的合并去重
email_num = []
key_words = ['email','mail','mailbox','邮件','邮箱','postbox']
for page in range(1,int(pages)+1):
for key_word in key_words:
bing_emails = bing_search(url,page,key_word)
baidu_emails = baidu_search(url,page,key_word)
sum_emails = bing_emails + baidu_emails
for email in sum_emails:
if email in email_num:
pass
else:
print(email)
with open('data.txt','a+') as f:
f.write(email + '\n')
email_num.append(email)

#bingSearch
def bing_search(url,page,key_word): #绕过Bing搜索引擎反爬(校验referer和cookie)
referer = "http://cn.bing.com/search?q=email+site%3abaidu.com&qs=n&sp=-1&pq=emailsite%3abaidu.com&first=1&FORM=PERE1"
conn = requests.session()
bing_url = "https://cn.bing.com/search?q="+key_word+"site%3a"+url+"&qs=n&sp=-1&pq="+key_word+"site%3a"+url+"&first="+str((page-1)*10)+"&FORM=PERE1"
conn.get('http://cn.bing.com',headers=headers(referer))
r = conn.get(bing_url,stream=True,headers=headers(referer),timeout=8)
emails = search_email(r.text)
return emails

#baiduSearch
def baidu_search(url,page,key_word): #绕过百度搜索引擎的反爬(JS请求链)
email_list = []
emails = []
referer = "https://www.baidu.com/s?wd=email+site%3Abaidu.com&pn=1"
baidu_url = "https://www.baidu.com/s?wd="+key_word+"+site%3A"+url+"&pn="+str((page-1)*10)
conn = requests.session()
conn.get(referer,headers=headers(referer))
r = conn.get(baidu_url, headers=headers(referer))
soup = BeautifulSoup(r.text, 'lxml')
tagh3 = soup.find_all('h3')
for h3 in tagh3:
href = h3.find('a').get('href')
try:
r = requests.get(href, headers=headers(referer),timeout=8)
emails = search_email(r.text)
except Exception as e:
pass
for email in emails:
email_list.append(email)
return email_list

def search_email(html):
emails = re.findall(r"[a-z0-9\.\-+_]+@[a-z0-9\.\-+_]+\.[a-z]+",html,re.I) #正则表达式获取邮箱号码
return emails

def headers(referer):
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
'Accept': '*/*',
'Accept-Language':'en-US,en;q=0.5',
'Accept-Encoding':'gzip,deflate',
'Referer':referer}
return headers

if __name__ == '__main__':
#定义异常
try:
start(sys.argv[1:])
except KeyboardInterrupt:
print("interrupted by user,killing all threads...")

本篇文章转载自https://blog.csdn.net/qq_41046513/article/details/121402824, 做了一些微调,非原创

  • 标题: Python实现网站邮箱爬取
  • 作者: Tengpaz
  • 创建于 : 2023-10-08 11:41:52
  • 更新于 : 2024-10-07 11:11:56
  • 链接: https://qinaida.cn/tengpaz/2023/10/08/Python实现网站邮箱爬取/
  • 版权声明: 本文章采用 CC BY-NC-SA 4.0 进行许可。
评论