python爬虫之自动登录discuz!刷分

共计 5019 个字符，预计需要花费 13 分钟才能阅读完成。

最近看论坛比较多，想提高在论坛的等级，就寻思着写个每天自动刷分的脚本。下面我们就从零开始用 python 实现一个自动登录，自动访问空间的脚本。我们就以 https://www.hostloc.com/ 作为我们的实验对象。

环境要求

我们需要一个 python3 的执行环境，还有 python 包管理器 pip, 针对实现整个功能我们需要两个等三方的包urllib3 和BeautifulSoup4。


# pip 不是环境变量
meshell@python# python3 -m pip install urllib3 BeautifulSoup4

# pip 是环境变量
meshell@python# pip install urllib3 BeautifulSoup4

基础定义

我们需要定义一个简单的类, 有 username, password, userAgent, host, identity_cookie_name, cookies 的一些属性。
我们把 username, password, host, identity_cookie_name 作为构造参数传入。


class HostLoc(object):
    username = None # 登录用户名
    password = None # 登录密码
    userAgent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:67.0) Gecko/20100101 Firefox/67.0" # 发送请求的 userAgent 头
    cookies = {}  # 记录所有请求的 cookie，定义成 map
    identity_cookie_name = None # 记录登录成功的 cookie 名称
    host = None # 网址 host

    def __init__(self, username, password, host, identity_cookie_name):
        self.username = username
        self.password = password
        self.host = host
        self.identity_cookie_name = identity_cookie_name

实现登录

在实现登录之前需要了解下 urllib3 的使用，通过这个库来发送 http 请求官方文档。为类实现一个公共的发送http 方法，因为我们对同一个站点发送请求，基本 cookie 和header每次都是一样的。


...
host = None

def __init__(self, username, password, host, identity_cookie_name):
    ...
    self.http = urllib3.PoolManager(cert_reqs=ssl.CERT_NONE, assert_hostname=False)
    ...

def _request(self, method, url, fields = None):
    headers = {
        "origin": self.host,
        "referer": self.referer,
        "User-Agent": self.userAgent,
    }
    if len(self.cookies) > 0:
        headers[\'cookie\'] = self.joinCookies()

    response = self.http.request(method, url, fields, headers)

    cookies = self.parseCookie(response.getheader(\'Set-Cookie\'))
    if len(cookies) > 0:
        self.cookies.update(cookies)

    return response
...

Note: 如果你想要去掉 urllib3 的https验证，你必须设置 cert_reqs=ssl.CERT_NONE, assert_hostname=False 这两个属性

因为 urllib3 头都是字符串形式，我们的 cookies 是定义成 map 形式，我们需要实现一个方法为它转换成 cookie 头形式。


def joinCookies(self):
    cookie_string = ""
    for key, value in self.cookies.items():
        cookie_string  = key   "="   value   ";"
    return cookie_string.rstrip(";")

解析 Cookie是整个请求的中最重要的一步，当登录成功的时候需要记录所有服务端发送的 cookie, 下请求下一次页面是需要把这些cookie 发送给服务端。看上面的请求方法，我们是通过一个 parseCookie 的方法来解析cookie，来看看它是怎么实现的。


def parseCookie(self, cookie = None):
    cookies = {}
    if cookie == None:
        return cookies
    for value in  cookie.split(";"):
        hash = value.split("=")
        if len(hash) < 2 or hash[0].strip("") in ["expires","Max-Age","path"]:
            continue
        name = hash[0]
        index = hash[0].find(\',\')
        if index != -1:
            name = name[index 1:].lstrip(" ")

        cookies[name] = hash[1]
    return cookies

上面的代码只解析了 cookie 的名字和值, 不需要过期时间和路径这些。

登录的实现现有的代码里只需要发送请求，判断登录成功的 cookie 有没有即可。

...

loginFromUrl = "/member.php?mod=logging&action=login&loginsubmit=yes&infloat=yes&lssubmit=yes&inajax=1"

def login(self):

    response = self._request("post", self.loginFromUrl, fields={
        "fastloginfield": "username",
        "username": self.username,
        "password": self.password,
        "quickforward": "yes",
        "handlekey": "ls"
    })

    if response.status == 400:
        print("服务器已限制")
        return False

    if self.identity_cookie_name in self.cookies:
        print("登录成功")
        return True
    return False

用户信息

在成功登录之后就可以获取当前用户的积分，威望，金钱等信息。使用 BeautifulSoup4 库来匹配页面的 html 元素，它就像 javascript 的jQuery库一样获取 DOM，就连API 也非常相似，你可以官方文档来查看基本的使用。定义一个方法来打印当前的用户信息。随便定义一个主题页面作为访问入口。


referer = "/forum-45-1.html"

creditUrl = "/home.php?mod=spacecp&ac=credit&showcredit=1&inajax=1&ajaxtarget=extcreditmenu_menu"

def __init__(self, username, password, host, identity_cookie_name):
    ...
    self.referer = self.host   self.referer
    self.creditUrl = self.host   self.creditUrl
    ...

def info(self):
    response = self._request(
        "post",
        self.referer
    )
    bs = BeautifulSoup(response.data, "lxml")

    score = bs.find("a", id="extcreditmenu").string
    name = bs.find("strong", class_="vwmy").string

    menu_response = self._request("get", self.creditUrl)
    menu_response_bs = BeautifulSoup(menu_response.data, "lxml")

    hcredit_1 = menu_response_bs.find("span", id="hcredit_1").string
    hcredit_2 = menu_response_bs.find("span", id="hcredit_2").string

    print("昵称: %s, %s\n 威望: %s, 金钱: %s" % (name, score, hcredit_1, hcredit_2))

上图的内容就是通过 BeautifulSoup4 获取出来的，我们不必自己写正则来获取。我们只使用了一个 find 来获取指定的内容，此函数也只会返回匹配的一条元素。

积分

在 discuz! 中获取积分有多很种方式，访问他人主页、发表帖子、回复帖子、每日登录等等都可以获取促积分。我们只实现其中最简单一个 访问他人主页，访问他人主页会自动加上积分。


def visitProfile(self, url):
    response = self._request("get", url)
    print(url   "\n")
    bs = BeautifulSoup(response.data, "lxml")
    self.visit_loop  = 1

    all = bs.find_all("a", class_="avt")

    visit_len = len(all)
    print(visit_len)
    if visit_len > 1 and self.visit_loop < 20:
        index = random.randint(2, visit_len - 1)
        self.visitProfile(self.host   "/"   all[index][\'href\'])

我们将此方法加入到获取信息方法中，以信息页面中第一个用户作为访问入口，之后通过他最近访问的人随机一个作为访问入口，20 次作为他的访问上限。

def info(self):
    ...
    first = self.host   "/"   bs.find("a", class_="notabs")["href"]
    print("昵称: %s, %s\n 威望: %s, 金钱: %s" % (name, score, hcredit_1, hcredit_2))
    self.visitProfile(first)

到此为止，基本流程已经走通。现在需要将获取的 cookie 写入到文件中，以保证下次执行不需要再次执行登录操作。我们将写入操作放在类的析构阶段，同时也需要在构造函数中读取cookie。


def __init__(self, username, password, host, identity_cookie_name):
    ...    
    self.open_file = open(\'./cookie.txt\', \'w\')
    self.origin_cookies = self.readCookie()
    if self.origin_cookies != None:
    for value in self.origin_cookies.split(";"):
        hash = value.split("=")
        self.cookies[hash[0]] = hash[1]

def writeCookie(self, cookie):
    self.open_file.write(cookie)
    self.open_file.close()

def readCookie(self):
    if not pathlib.Path("./cookie.txt").exists():
        return None
    with open(\'cookie.txt\', \'r\') as f:
        return f.read()

def __del__(self):
    if len(self.cookies) > 0:
        self.writeCookie(self.joinCookies())

你可能会问为什么在构造函数里面去打开文件，而不是在 writeCookie 里面去执行 open 操作。因为 python 在__del__是不可以执行 open 操作的。我们已完成了整个操作。

python爬虫之自动登录discuz!刷分

环境要求

基础定义

实现登录

用户信息

积分

推荐阅读

归档