Python工具类

使用pyinstaller打包

在pycharm中点击新建环境

进入文件后，会有一个文件夹venv，用来存放管理的库文件

创建requirements.txt

pip install requests

pip freeze > requirements.txt

可以看到虚拟环境下安装的包

安装pyinstaller

pip install pyinstaller

安装完成后将终端关闭，重新打开

我们输入pyinstaller，当出现一下结果，说明安装成功

单文件打包

pyinstaller -D app.py

开始打包

之后在当前文件夹会出现三个文件

build：中间编译时，产出的文件
dist：打包生成后的文件目录
xxx.spec：打包的配置文件

在dist中可以看到有我们上面起的名字app，只需将app文件夹发给他人，打开里面的app.exe文件，即可使用

文件读写

with open('xxx.txt', 'w', encoding='utf-8') as f:
	f.write('xxx' + '\n')
    f.close()

爬虫

代码案例

from bs4 import BeautifulSoup
import requests

def getHTMLText(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
        res = requests.get(url, headers=headers)
        res.raise_for_status()
        res.encoding = res.apparent_encoding
        return res.text
    except:
        return 'error'

def get_info(url):
    demo = getHTMLText(url)
    soup = BeautifulSoup(demo, 'html.parser')
    table = soup.find('table', {'class': 'rk-table'})
    tbody = table.find('tbody')
    rows = tbody.find_all('tr')
    for row in rows:
        tds = row.find_all('td')
        rank = tds[0].text.strip()
        name = row.find('a').text.strip()
        location = tds[2].text.strip()
        total = tds[4].text.strip()
        data = {
            '排名': rank,
            '学校名称': name,
            '省市': location,
            '总分': total
        }
        print(data)


url_2020 = 'https://www.shanghairanking.cn/rankings/bcur/2020'
url_2021 = 'https://www.shanghairanking.cn/rankings/bcur/2021'
get_info(url_2021)

爬虫框架Scrapy

简介

基本功能
Scrapy是一个适用爬取网站数据、提取结构性数据的应用程序框架，它可以应用在广泛领域：Scrapy 常应用在包括数据挖掘，信息处理或存储历史数据等一系列的程序中。通常我们可以很简单的通过 Scrapy 框架实现一个爬虫，抓取指定网站的内容或图片。
架构
Scrapy Engine(引擎)：负责Spider、ItemPipeline、Downloader、Scheduler中间的通讯，信号、数据传递等。
Scheduler(调度器)：它负责接受引擎发送过来的Request请求，并按照一定的方式进行整理排列，入队，当引擎需要时，交还给引擎。
Downloader（下载器）：负责下载Scrapy Engine(引擎)发送的所有Requests请求，并将其获取到的Responses交还给Scrapy Engine(引擎)，由引擎交给Spider来处理。
Spider（爬虫）：它负责处理所有Responses,从中分析提取数据，获取Item字段需要的数据，并将需要跟进的URL提交给引擎，再次进入Scheduler(调度器)。
Item Pipeline(管道)：它负责处理Spider中获取到的Item，并进行进行后期处理（详细分析、过滤、存储等）的地方。
Downloader Middlewares（下载中间件）：一个可以自定义扩展下载功能的组件。
Spider Middlewares（Spider中间件）：一个可以自定扩展和操作引擎和Spider中间通信的功能组件。

scrapy7

scrapy项目的结构

项目名字
    项目的名字
        spiders文件夹（存储的是爬虫文件）
            init
            自定义的爬虫文件        核心功能文件
        init
        items           定义数据结构的地方 爬虫的数据都包含哪些
        middleware      中间件 代理
        pipelines       管道  用来处理下载的数据
        settings        配置文件 robots协议 ua定义等

环境搭建

pip install scrapy

项目搭建

scrapy startproject 工程名（img）

结果

New Scrapy project 'img', using template directory 'D:\Environment\Anaconda3\Lib\site-packages\scrapy\templates\project', created in:
    C:\Users\Admin\Desktop\scrapy\img

You can start your first spider with:
    cd img
    scrapy genspider example example.com

转到项目，创建源文件

cd img
scrapy genspider unsplash unsplash.com

结果

Created spider 'unsplash' using template 'basic' in module:
  img.spiders.unsplash

运行

scrapy crawl unsplash(为文件名)

Scrapy爬取unsplash

一、分析网站

高清图片网站https://unsplash.com/, 能展示超过7w+张高清图片. 浏览时, 其通过API返回图片的URl
在chrome浏览器中有此插件unsplash, 在插件文件中找到对应JS, 再找出api地址

根据插件安装的时间找到对应的chrome插件目录

二、爬取图片URL

安装Scrapy, pip install Scrapy
在工程目录中创建项目scrapy startproject scrapy_unsplash E:\wendi1\pyplace\scrapy_unsplash
进入项目目录, 创建爬虫scrapy genspider unsplash api.unsplash.com —template=crawl
- 此时在./scrapy_unsplash/spiders目录中生成unsplash.py文件.
- 配置settings.py , 并发连接数CONCURRENT_REQUESTS = 100 , 下载延迟DOWNLOAD_DELAY = 1.6
首先需要爬取网站, 然后将raw种类的图片URL存入sqlite, unsplash.py代码:

# -*- coding: utf-8 -*-
import json
import sqlite3
import threading
import scrapy
from scrapy.spiders import CrawlSpider
 
 
class UnsplashSpider(CrawlSpider):
    name = 'unsplash'
    allowed_domains = ['api.unsplash.com']
 
    def start_requests(self):
        createDB()  # 创建数据库
 
        start, page = 1, 2000,  # 要爬的页面数
        for i in range(start, page + 1):  # 从第一页开始
            url = 'https://api.unsplash.com/photos/?client_id=fa60305aa82e74134cabc7093ef54c8e2c370c47e73152f72371c828daedfcd7&page=' + str(
                i) + '&per_page=30'
            yield scrapy.Request(url=url, callback=self.parse_item)
 
    def parse_item(self, response):
        conn = sqlite3.connect("E:\\wendi1\\pyplace\\scrapy_unsplash\\database\\link.db")  # 连接数据库
        print('-------------------')
 
        js = json.loads(str(response.body_as_unicode()), 'utf-8')  # 读取响应body，并转化成可读取的json
        for j in js:
            link = j["urls"]["raw"]
            sql = "INSERT INTO LINK(LINK) VALUES ('%s');" % link  # 将link插入数据库
            conn.execute(sql)
        semaphore = threading.Semaphore(1)  # 引入线程信号量，避免写入数据库时死锁
        semaphore.acquire()  # P操作
        conn.commit()  # 写入数据库，此时数据库文件独占
        semaphore.release()  # V操作
 
 
def createDB():  # 创建数据库
    conn = sqlite3.connect("E:\\wendi1\\pyplace\\scrapy_unsplash\\database\\link.db")  # Sqlite是一个轻量数据库，不占端口，够用
    conn.execute("DROP TABLE IF EXISTS LINK;")  # 重新运行删掉数据库
    conn.execute("CREATE TABLE LINK ("  # 创建属性ID：主键自增；属性LINK：存放图片链接
                 "ID INTEGER PRIMARY KEY AUTOINCREMENT,"
                 "LINK VARCHAR(255));")

执行scrapy crawl unsplash . 在第16行可修改爬取的页面总数

url已存入sqlite中 :

数据库管理

import pymysql

conn=pymysql.connect(host = '127.0.0.1' # 连接名称，默认127.0.0.1
,user = 'root' # 用户名
,passwd='123456' # 密码
,port= 3306 # 端口，默认为3306
,db='python' # 数据库名称
,charset='utf8' # 字符编码
)
cur = conn.cursor() # 生成游标对象
if cur:
    print('数据库连接成功')
sql="select * from `web` " # SQL语句
cur.execute(sql) # 执行SQL语句
data = cur.fetchall() # 通过fetchall方法获得数据
for i in data[:1]: # 打印输出1条数据
    print(i)
cur.close() # 关闭游标
conn.close() # 关闭连接

爬虫案例（写入数据库）

实例

import json
import re
import pymysql
import requests

conn = pymysql.connect(host='127.0.0.1'  # 连接名称，默认127.0.0.1
                       , user='root'  # 用户名
                       , passwd='123456'  # 密码
                       , port=3306  # 端口，默认为3306
                       , db='python'  # 数据库名称
                       , charset='utf8'  # 字符编码
                       )
cur = conn.cursor()  # 生成游标对象


def getHTMLText(url):
    try:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
        res = requests.get(url, headers=headers)
        res.raise_for_status()
        res.encoding = res.apparent_encoding
        return res.text
    except:
        return 'error'


def main(url):
    res = getHTMLText(url)
    regex = re.compile(r"(?=\()(.*)(?<=\))")
    jsonString = regex.findall(res)[-1]
    jsonString = jsonString.strip('()')
    jsonData = json.loads(jsonString)
    diffValue = jsonData['data']['diff']
    for i in range(len(diffValue)):
        demo = diffValue[i]['f12']
        name = diffValue[i]['f14']
        sql = "INSERT INTO money(demo,name) VALUES ('{}', '{}');".format(demo, name)
        cur.execute(sql)
        conn.commit()
        print(demo+name)
    cur.close()
    conn.close()


url = 'http://85.push2.eastmoney.com/api/qt/clist/get?cb=jQuery1124007516892373587614_1682312504170&pn=1&pz=20&po=1&np=1&ut=bd1d9ddb04089700cf9c27f6f7426281&fltt=2&invt=2&wbp2u=|0|0|0|web&fid=f3&fs=m:0+t:6,m:0+t:80,m:1+t:2,m:1+t:23,m:0+t:81+s:2048&fields=f1,f2,f3,f4,f5,f6,f7,f8,f9,f10,f12,f13,f14,f15,f16,f17,f18,f20,f21,f23,f24,f25,f22,f11,f62,f128,f136,f115,f152&_=1682312504458'
main(url)

图片转文字画

import numpy as np
from PIL import Image
import time

if __name__ == '__main__':
    start_time = time.time()
    image_file = 'lena.png'
    height = 100

    img = Image.open(image_file)
    img_width, img_height = img.size
    width = int(1.8 * height * img_width // img_height)
    img = img.resize((width, height), Image.ANTIALIAS)
    pixels = np.array(img.convert('L'))
    print('type(pixels) = ', type(pixels))
    print(pixels.shape)
    print(pixels)
    chars = "MNHQ$OC?7>!:-;. "
    N = len(chars)
    step = 256 // N
    print(N)
    result = ''
    for i in range(height):
        for j in range(width):
            result += chars[pixels[i][j] // step]
        result += '\n'
    with open('text.txt', mode='w') as f:
        f.write(result)
    end_time = time.time()
    elapsed_time = end_time - start_time
    print(f'程序耗时：{elapsed_time:.2f} 秒')

示例

GUI

用 Python 来写 GUI 的库：pyqt、wxpython、tkinter、kivy

比较常用的是 tkinter它是 Python 内置的库

检测是否存在

打开cmd，输入：

python -m tkinter

开始使用

tkinter 把不同的组件都封装成了 Class

每个组件都有一些属性可以设置，比如可以设置字体常用的宽高字体颜色

import tkinter as tk

app = tk.Tk()
app.title("My App") # 定义窗口的标题
app.geometry("600x400")	# 使用 geometry指定窗口的宽高
lb = tk.Label(app, text="Hello World")	# 想要往里加入文本，可以使用 Label 对象
lb.pack()	# 用 pack 塞到窗口中去
tk.mainloop()  # 调用了 mainloop 方法，主要是让它去循环等待用户的交互

示例：

import tkinter as tk

app = tk.Tk()
app.title("Jarvis")
app.geometry("600x400")
lb = tk.Label(app, text="Hello World", width=20, height=10, fg="blue").pack()
tk.Button(app, text="点我", width=20, background='pink').pack()
tk.mainloop()

按钮可以定义点击事件：当点击按钮的时候调用方法来修改 Lable 里面的内容（command 来绑定回调函数）