提交 ba83d019 编写于 作者: 懂一点的陈老师's avatar 懂一点的陈老师

update all chapter

上级 afa9394d
# Python入门与数据分析基础
## Python入门教程
* 第一章:基本概念
* 第二章:常用数据类型
* 第三章:字符串处理和格式化输出
......@@ -13,17 +13,9 @@
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"1 100 -80 0\n"
"outputs": [],
"source": [
"a1 = 1\n",
"a2 = 100\n",
......@@ -438,7 +430,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.1"
"nbformat": 4,
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### SQLite数据库\n",
"windows 安装 [下载](https://www.sqlite.org/download.html)\n",
"## 使用SQLite\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 导入SQLite驱动:\n",
"import sqlite3\n",
"# 连接到SQLite数据库\n",
"# 数据库文件是test.db\n",
"# 如果文件不存在,会自动在当前目录创建:\n",
"# 删掉已经存在的数据库\n",
"db_file = 'test.db'\n",
"if os.path.isfile(db_file):\n",
" os.remove(db_file)\n",
"conn = sqlite3.connect('test.db')"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建一个Cursor:\n",
"cursor = conn.cursor()"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 执行一条SQL语句,创建user表:\n",
"cursor.execute('create table user (id varchar(20) primary key, name varchar(20))')"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 继续执行一条SQL语句,插入一条记录:\n",
"cursor.execute(\"insert into user (id, name) values ('5', 'Mike')\")"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 通过rowcount获得插入的行数:\n",
"# 关闭Cursor:\n",
"# 提交事务:\n",
"# 关闭Connection:\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 查询记录\n",
"conn = sqlite3.connect('test.db')\n",
"cursor = conn.cursor()\n",
"# 执行查询语句:\n",
"cursor.execute('select * from user' )\n",
"# 获得查询结果集:\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习\n",
"import os, sqlite3\n",
"db_file = 'exec.db'\n",
"if os.path.isfile(db_file):\n",
" os.remove(db_file)\n",
"# 创建表格\n",
"conn = sqlite3.connect(db_file)\n",
"cursor = conn.cursor()\n",
"cursor.execute('create table student (id varchar(20) primary key, name varchar(20), score int)')\n",
"# 插入数据\n",
"cursor.execute(\"insert into student values ('A-001', 'Adam', 95)\")\n",
"# 再插入两条数据\n",
"def get_all():\n",
" # 连接到SQLite数据库\n",
" # 执行一条SQL语句\n",
" # 获得查询结果集\n",
" ???\n",
" print(all_value)"
"cell_type": "markdown",
"metadata": {},
"source": [
"## SQLAlchemy\n",
"SQLAlchemy是python的一个数据库ORM工具,提供了强大的对象模型间的转换,可以满足绝大多数数据库操作的需求,并且支持多种数据库引擎(sqlite,mysql,postgres, mongodb等)\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### connection\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sqlalchemy import create_engine"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 数据库连接字符串\n",
"DB_CONNECT_STRING = 'sqlite:///test.db'\n",
"# 创建数据库引擎,echo为True,会打印所有的sql语句\n",
"engine = create_engine(DB_CONNECT_STRING, echo=True)\n",
"# 创建一个connection,这里的使用方式与python自带的sqlite的使用方式类似\n",
"with engine.connect() as con:\n",
" # 执行sql语句,如果是增删改,则直接生效,不需要commit\n",
" rs = con.execute('select * from user')\n",
" data = rs.fetchone()\n",
" print(\"Data: %s\" % data)"
"cell_type": "markdown",
"metadata": {},
"source": [
"### connection事务\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DB_CONNECT_STRING = 'sqlite:///test.db'\n",
"engine = create_engine(DB_CONNECT_STRING)\n",
"with engine.connect() as connection:\n",
" trans = connection.begin()\n",
" try:\n",
" r0 = connection.execute(\"create table book (id varchar(20) primary key, name varchar(20), user_id varchar(20))\")\n",
" #r0 = connection.execute(\"create table user (id varchar(20) primary key, name varchar(20))\")\n",
" except:\n",
" print(\"已经有这个数据库,不用创建,继续...\")\n",
" try:\n",
" \n",
" r1 = connection.execute(\"insert into book (id,name, user_id) values ('3', 'Lucxx', '2')\")\n",
" r2 = connection.execute(\"select * from book\")\n",
" trans.commit()\n",
" print(r2.fetchall())\n",
" except:\n",
" trans.rollback()\n",
" raise"
"cell_type": "markdown",
"metadata": {},
"source": [
"## session\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sqlalchemy.orm import sessionmaker"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 常用模式\n",
"# 数据库连接字符串\n",
"DB_CONNECT_STRING = 'sqlite:///test.db'\n",
"# 创建数据库引擎,echo为True,会打印所有的sql语句\n",
"engine = create_engine(DB_CONNECT_STRING, echo=True)\n",
"# 创建会话类\n",
"DB_Session = sessionmaker(bind=engine)\n",
"# 创建会话对象\n",
"session = DB_Session()\n",
"# 在回话中处理数据库操作\n",
"# 如果再次运行,不要运行创建表\n",
"#session.execute(\"create table member (id varchar(20) primary key, name varchar(20))\")\n",
"session.execute(\"insert into member(id, name) values('3', '小样')\")\n",
"session.commit() #来确认修改和增加的内容\n",
"# 用完记得关闭,也可以用with\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习 链接数据库,创建一个学生信息表(Student), 字段: id, name, age, 插入一条数据: (1, Tom, 19)\n",
"from sqlalchemy import create_engine\n",
"# 数据库链接设置\n",
"engine = create_engine(DB_CONNECT_STRING, echo=True)\n",
"with engine.connect() as connection:\n",
" trans = ???\n",
" ???:\n",
" r0 = connection.???(\"create table Student (id varchar(20) primary key, name varchar(20), age int)\")\n",
" ???:\n",
" print(\"已经有这个数据库,不用创建,继续...\")\n",
" try:\n",
" \n",
" r1 = connection.???(\"insert into Student (id,name,age) values ('1', 'Tom', 19)\")\n",
" trans.???\n",
" except:\n",
" trans.rollback()\n",
" raise"
"cell_type": "markdown",
"metadata": {},
"source": [
"### ORM\n",
"Object-Relational Mapping,把关系数据库的表结构映射到对象上\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class Member(object):\n",
" def __init__(self, id, name):\n",
" self.id = id\n",
" self.name = name"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 第一步,导入SQLAlchemy,并初始化DBSession:"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 导入:\n",
"from sqlalchemy import Column, String, create_engine\n",
"from sqlalchemy.orm import sessionmaker\n",
"from sqlalchemy.ext.declarative import declarative_base"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建对象的基类:\n",
"Base = declarative_base()\n",
"# 数据库连接字符串\n",
"DB_CONNECT_STRING = 'sqlite:///db.sqlite'\n",
"# 定义User对象:\n",
"class User(Base):\n",
" # 表的名字:\n",
" __tablename__ = 'user'\n",
" # 表的结构:\n",
" id = Column(String(20), primary_key=True)\n",
" name = Column(String(20))\n",
"# 初始化数据库连接:\n",
"engine = create_engine(DB_CONNECT_STRING)\n",
"# 创建DBSession类型:\n",
"DBSession = sessionmaker(bind=engine)"
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建session对象:\n",
"session = DBSession()\n",
"# 创建新User对象:\n",
"new_user = User(id='2', name='Bob')\n",
"# 添加到session:\n",
"# 提交即保存到数据库:\n",
"# 关闭session:\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 从数据库表中查询数据\n",
"# 创建Session:\n",
"session = DBSession()\n",
"# 创建Query查询,filter是where条件,最后调用one()返回唯一行,如果调用all()则返回所有行:\n",
"user = session.query(User).filter(User.id=='1').one()\n",
"# 打印类型和对象的name属性:\n",
"print('type:', type(user))\n",
"print('name:', user.name)\n",
"# 关闭Session:\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sqlalchemy import ForeignKey\n",
"# 创建一个书的类\n",
"class Book(Base):\n",
" __tablename__ = 'book'\n",
" id = Column(String(20), primary_key=True)\n",
" name = Column(String(20))\n",
" # “多”的一方的book表是通过外键关联到user表的:\n",
" user_id = Column(String(20), ForeignKey('user.id'))"
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建session对象:\n",
"session = DBSession()\n",
"# 创建新User对象:\n",
"new_user = User(id='21', name='Kerry')\n",
"# 添加到session:\n",
"new_book = Book(id='10', name='Learn Python', user_id = new_user.id)\n",
"print('书本名字:%s, 用户:%s' % (new_book.name, new_user.name))\n",
"# 提交即保存到数据库:\n",
"# 关闭session:\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习1 从数据库表中查询book数据\n",
"# 创建Session:\n",
"session = DBSession()\n",
"# 创建Query查询,filter是where条件,最后调用one()返回唯一行,如果调用all()则返回所有行:\n",
"book = session.query(???).filter(???.id=='3').one()\n",
"# 打印结果\n",
"print('book id', ???)\n",
"print('book name:', ???)\n",
"# 关闭Session:\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习\n",
"### 总结\n",
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"nbformat": 4,
"nbformat_minor": 2
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"## Web应用介绍\n",
"![web application](http://intelhunt.com/images/Web-Application-Development1.jpg)\n",
"### HTML\n",
" \n",
"### CSS\n",
"CSS是Cascading Style Sheets(层叠样式表)的简称,CSS用来控制HTML里的所有元素如何展现\n",
"### JavaScript简介\n",
"### Web应用例子\n",
"### 静态网站和动态网站\n",
"1. http://www.wuzhen.com.cn/\n",
"2. https://similar.ai/\n",
"3. https://www.taobao.com/\n",
"4. https://www.toutiao.com/"
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# server.py\n",
"# 从wsgiref模块导入:\n",
"from wsgiref.simple_server import make_server\n",
"# 导入我们自己编写的application函数:\n",
"# from hello import application\n",
"def application(environ, start_response):\n",
" start_response('200 OK', [('Content-Type', 'text/html')])\n",
" return [b'<h1>Hello, web!</h1>']\n",
"# 创建一个服务器,IP地址为空,端口是8000,处理函数是application:\n",
"httpd = make_server('', 8000, application)\n",
"print('Serving HTTP on port 8000...')\n",
"# 开始监听HTTP请求:\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"## 练习\n",
"### 参考上面的例子,开启一个服务器"
"cell_type": "markdown",
"metadata": {},
"source": [
"## Flask框架\n",
"WSGI提供的接口虽然比HTTP接口高级了不少,但和Web App的处理逻辑比,还是比较低级,我们需要在WSGI接口之上能进一步抽象,让我们专注于用一个函数处理一个URL,至于URL到函数的映射,就交给Web框架来做\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 再来一个Hello\n",
"from flask import Flask\n",
"app = Flask(__name__)\n",
"def hello_world():\n",
" return 'Hello, World!'"
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习:\n",
"先看 flask_post.py 和 flask_example.py\n",
"再看 flask_api.py,完成里面练习,需要用到之前的db_test.py 和 db.sqlite 数据库"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"nbformat": 4,
"nbformat_minor": 2
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"## Matplotlib [官网](https://matplotlib.org/users/pyplot_tutorial.html)\n",
"3.后端是设备相关的绘图设备,也称为渲染器,将前端表示转换为打印件或显示设备;后端示例:PS 创建 PostScript® 打印件,SVG 创建可缩放矢量图形打印件,Agg 使用 Matplotlib 附带的高质量反颗粒几何库创建 PNG 输出,GTK 在 Gtk+ 应用程序中嵌入 Matplotlib,GTKAgg 使用反颗粒渲染器创建图形并将其嵌入到 Gtk+ 应用程序中,以及用于 PDF,WxWidgets,Tkinter 等\n"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib\n",
"import matplotlib.mlab as mlab\n",
"import matplotlib.pyplot as plt"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 例子1\n",
"def simple_plot():\n",
" # 生成测试数据\n",
" x = np.linspace(-np.pi, np.pi, 256, endpoint=True)\n",
" y_cos, y_sin = np.cos(x), np.sin(x)\n",
" # 生成画布,并设定标题\n",
" # 画布大小,dpi=清晰度\n",
" plt.figure(figsize=(8, 6), dpi=80)\n",
" plt.title(\"Simple plot\")\n",
" plt.grid(True) # 带网格\n",
" # 设置X轴\n",
" plt.xlabel(\"X\")\n",
" plt.xlim(-4.0, 4.0)\n",
" plt.xticks(np.linspace(-4, 4, 9, endpoint=True))\n",
" # 设置Y轴\n",
" plt.ylabel(\"Y\")\n",
" plt.ylim(-1.0, 1.0)\n",
" plt.yticks(np.linspace(-1, 1, 9, endpoint=True))\n",
" # 画两条曲线\n",
" plt.plot(x, y_cos, \"b--\", linewidth=2.0, label=\"cos\")\n",
" plt.plot(x, y_sin, \"g-\", linewidth=2.0, label=\"sin\")\n",
" # 设置图例位置,loc可以为[upper, lower, left, right, center]\n",
" plt.legend(loc=\"upper left\",shadow=True) \n",
" # 图形显示\n",
" plt.show()\n",
" return"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 更复杂一点\n",
"def simple_advanced_plot():\n",
" \"\"\"\n",
" simple advanced plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" x = np.linspace(-np.pi, np.pi, 256, endpoint=True)\n",
" y_cos, y_sin = np.cos(x), np.sin(x)\n",
" # 生成画布, 并设定标题\n",
" plt.figure(figsize=(8, 6), dpi=80)\n",
" plt.title(\"simple advanced plot\")\n",
" plt.grid(True)\n",
" # 画图的另外一种方式\n",
" ax_1 = plt.subplot(111) # 也可以写成plt.subplot(1,1,1)\n",
" ax_1.plot(x, y_cos, color=\"blue\", linewidth=2.0, linestyle=\"--\", label=\"left cos\")\n",
" ax_1.legend(loc=\"upper left\", shadow=True)\n",
" # 设置Y轴(左边)\n",
" ax_1.set_ylabel(\"left cos y\")\n",
" ax_1.set_ylim(-1.0, 1.0)\n",
" ax_1.set_yticks(np.linspace(-1, 1, 9, endpoint=True))\n",
" # 画图的另外一种方式\n",
" ax_2 = ax_1.twinx()\n",
" ax_2.plot(x, y_sin, color=\"green\", linewidth=2.0, linestyle=\"-\", label=\"right sin\")\n",
" ax_2.legend(loc=\"upper right\", shadow=True)\n",
" # 设置Y轴(右边)\n",
" ax_2.set_ylabel(\"right sin y\")\n",
" ax_2.set_ylim(-2.0, 2.0)\n",
" ax_2.set_yticks(np.linspace(-2, 2, 9, endpoint=True))\n",
" # 设置X轴(共同)\n",
" ax_1.set_xlabel(\"x\")\n",
" ax_1.set_xlim(-4.0, 4.0)\n",
" ax_1.set_xticks(np.linspace(-4, 4, 9, endpoint=True))\n",
" # 图形显示\n",
" plt.show()\n",
" return"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 练习把上面的图,改一下线段颜色和形式, 如:red, yellow, ; -."
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 一次画多个图\n",
"def subplot_plot():\n",
" \"\"\"\n",
" subplot plot\n",
" \"\"\"\n",
" # 子图的style列表\n",
" style_list = [\"g+-\", \"r*-\", \"b.-\", \"yo-\"]\n",
" \n",
" plt.figure(figsize=(8, 6), dpi=80)\n",
" # 依次画图\n",
" for num in range(4):\n",
" # 生成测试数据\n",
" x = np.linspace(0.0, 2+num, num=10*(num+1))\n",
" y = np.sin((5-num) * np.pi * x)\n",
" # 子图的生成方式\n",
" plt.subplot(2, 2, num+1)\n",
" plt.title(\"sub plot %d\" % (num+1))\n",
" plt.plot(x, y, style_list[num])\n",
" # 图形显示\n",
" plt.show()\n",
" return"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 柱状图\n",
"def bar_plot():\n",
" \"\"\"\n",
" bar plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" means_men = (20, 35, 30, 35, 27)\n",
" means_women = (25, 32, 34, 20, 25)\n",
" # 设置标题\n",
" plt.title(\"bar plot\")\n",
" # 设置相关参数\n",
" index = np.arange(len(means_men))\n",
" bar_width = 0.35\n",
" # 画柱状图\n",
" plt.bar(index, means_men, width=bar_width, alpha=0.2, color=\"b\", label=\"boy\")\n",
" plt.bar(index+bar_width, means_women, width=bar_width, alpha=0.8, color=\"r\", label=\"lady\")\n",
" plt.legend(loc=\"upper right\",shadow=True)\n",
" # 设置柱状图标示\n",
" for x, y in zip(index, means_men):\n",
" plt.text(x, y+0.3, y, ha=\"center\", va=\"bottom\")\n",
" for x, y in zip(index, means_women):\n",
" plt.text(x+bar_width, y+0.3, y, ha=\"center\", va=\"bottom\")\n",
" # 设置刻度范围/坐标轴名称等\n",
" plt.ylim(0, 45)\n",
" plt.xlabel(\"Group\")\n",
" plt.ylabel(\"Scores\")\n",
" plt.xticks(index+(bar_width/2), (\"A\", \"B\", \"C\", \"D\", \"E\"))\n",
" # 图形显示\n",
" plt.show()\n",
" return"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 横向柱状图\n",
"def barh_plot():\n",
" \"\"\"\n",
" barh plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" means_men = (20, 35, 30, 35, 27)\n",
" means_women = (25, 32, 34, 20, 25)\n",
" # 设置标题\n",
" plt.title(\"barh plot\")\n",
" # 设置相关参数\n",
" index = np.arange(len(means_men))\n",
" bar_height = 0.35\n",
" # 画柱状图(水平方向)\n",
" plt.barh(index, means_men, height=bar_height, alpha=0.2, color=\"b\", label=\"Men\")\n",
" plt.barh(index+bar_height, means_women, height=bar_height, alpha=0.8, color=\"r\", label=\"Women\")\n",
" plt.legend(loc=\"upper right\", shadow=True)\n",
" # 设置柱状图标示\n",
" for x, y in zip(index, means_men):\n",
" plt.text(y+0.3, x, y, ha=\"left\", va=\"center\")\n",
" for x, y in zip(index, means_women):\n",
" plt.text(y+0.3, x+bar_height, y, ha=\"left\", va=\"center\")\n",
" # 设置刻度范围/坐标轴名称等\n",
" plt.xlim(0, 45)\n",
" plt.xlabel(\"Scores\")\n",
" plt.ylabel(\"Group\")\n",
" plt.yticks(index+(bar_height/2), (\"A\", \"B\", \"C\", \"D\", \"E\"))\n",
" # 图形显示\n",
" plt.show()\n",
" return"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 层次柱状图\n",
"def table_plot():\n",
" \"\"\"\n",
" table plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" data = np.array([\n",
" [1, 4, 2, 5, 2],\n",
" [2, 1, 1, 3, 6],\n",
" [5, 3, 6, 4, 1]\n",
" ])\n",
" # 设置标题\n",
" plt.title(\"table plot\")\n",
" # 设置相关参数\n",
" index = np.arange(len(data[0]))\n",
" color_index = [\"r\", \"g\", \"b\"]\n",
" # 声明底部位置\n",
" bottom = np.array([0, 0, 0, 0, 0])\n",
" # 依次画图,并更新底部位置\n",
" for i in range(len(data)):\n",
" plt.bar(index, data[i], width=0.5, color=color_index[i], bottom=bottom, alpha=0.7, label=\"label %d\" % i)\n",
" bottom += data[i]\n",
" # 设置图例位置\n",
" plt.legend(loc=\"upper left\", shadow=True)\n",
" # 图形显示\n",
" plt.show()\n",
" return"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 饼图\n",
"def pie_plot():\n",
" \"\"\"\n",
" pie plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" sizes = [15, 30, 45, 10]\n",
" labels = [\"Frogs\", \"Cat\", \"Dogs\", \"Logs\"]\n",
" colors = [\"yellowgreen\", \"gold\", \"lightskyblue\", \"lightcoral\"]\n",
" # 设置标题\n",
" plt.title(\"pie\")\n",
" # 设置突出参数\n",
" explode = [0, 0.05, 0, 0]\n",
" # 画饼状图\n",
" patches, l_text, p_text = plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct=\"%1.1f%%\", shadow=True, startangle=90)\n",
" plt.axis(\"equal\")\n",
" # 图形显示\n",
" plt.show()\n",
" return\n",
"# pie_plot()"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 散点图\n",
"def scatter_plot():\n",
" \"\"\"\n",
" scatter plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" point_count = 1000\n",
" x_index = np.random.random(point_count)\n",
" y_index = np.random.random(point_count)\n",
" # 设置标题\n",
" plt.title(\"scatter\")\n",
" # 设置相关参数\n",
" color_list = np.random.random(point_count)\n",
" scale_list = np.random.random(point_count) * 100\n",
" # 画散点图\n",
" plt.scatter(x_index, y_index, s=scale_list, c=color_list, marker=\"o\")\n",
" # 图形显示\n",
" plt.show()\n",
" return\n",
"# scatter_plot()"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 雷达图\n",
"def radar_plot():\n",
" \"\"\"\n",
" radar plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" labels = np.array([\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"])\n",
" data = np.array([68, 83, 90, 77, 89, 73])\n",
" theta = np.linspace(0, 2*np.pi, len(data), endpoint=False)\n",
" # 数据预处理\n",
" data = np.concatenate((data, [data[0]]))\n",
" theta = np.concatenate((theta, [theta[0]]))\n",
" # 画图方式\n",
" plt.subplot(111, polar=True)\n",
" plt.title(\"radar\")\n",
" # 设置\"theta grid\"/\"radar grid\"\n",
" plt.thetagrids(theta*(180/np.pi), labels=labels)\n",
" plt.rgrids(np.arange(20, 100, 20), labels=np.arange(20, 100, 20), angle=0)\n",
" plt.ylim(0, 100)\n",
" # 画雷达图,并填充雷达图内部区域\n",
" plt.plot(theta, data, \"bo-\", linewidth=2)\n",
" plt.fill(theta, data, color=\"red\", alpha=0.25)\n",
" \n",
" # 保存图片\n",
" plt.savefig('radar.png')\n",
" # 图形显示\n",
" plt.show()\n",
" return"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 练习\n",
"# 测试数据,5次考试的平均值, 第一行是1班考试成绩,第二行是2班考试成绩, 自定义一个主题,画一个柱状图, 画一个饼图, 如1班和2班5次成绩比较, \n",
"data = np.array([\n",
" [80, 84, 92, 100, 62],\n",
" [60, 100, 100, 93, 86],\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pandas 中的使用\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import pandas as pd"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.linspace(0,2*np.pi,100) # 从0 到 2π 取100份\n",
"df = pd.DataFrame(data={'sin':np.sin(x),'cos':np.cos(x)},index=x)#创建DataFrame对象\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.plot(title='title', fontsize=20, figsize=(8, 6), grid=True)\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = pd.Series(data=np.random.randint(0,10,size=5),index=list('abcde')) \n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(np.random.randint(0,150,size=(20,3)),columns=['python','math','eng'])\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['php'] = df['python'].map(lambda x: x*0.9+np.random.randint(-10,10,1)[0])\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习 录入数据, 用id 和 price 两个参数画一个柱状图 , 散点图\n",
"wz_df = pd.read_csv('datas/waizi_v2.csv')\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"wz_df = wz_df[wz_df['type']=='合同外资金额']\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": []
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
"nbformat": 4,
"nbformat_minor": 2
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"## Seaborn\n",
"1. 在只需要简单地作图时直接用Pandas,但要想做出更加吸引人,更丰富的图就可以使用Seaborn\n",
"2. Pandas的作图函数并没有太多的参数来调整图形,所以你必须要深入了解matplotlib\n",
"3. Seaborn的作图函数中提供了大量的参数来调整图形,所以并不需要太深入了解matplotlib\n",
"4. Seaborn的API:https://stanford.edu/~mwaskom/software/seaborn/api.html#style-frontend"
"cell_type": "markdown",
"metadata": {},
"source": [
"## 泰坦尼克号数据分析\n",
"## 掌握数据概况"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n",
"from sklearn.utils.testing import ignore_warnings\n",
"def warn(*args, **kwargs):\n",
" pass\n",
"import warnings\n",
"warnings.warn = warn"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"df_train = pd.read_csv(\"titanic/train.csv\")\n",
"df_test = pd.read_csv(\"titanic/test.csv\") # 留作练习让你们分析"
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"- PassengerId => 乘客ID\n",
"- Survived => 是否获救\n",
"- Pclass => 乘客等级(1/2/3等舱位)\n",
"- Name => 乘客姓名\n",
"- Sex => 性别\n",
"- Age => 年龄\n",
"- SibSp => 堂兄弟/妹个数\n",
"- Parch => 父母与小孩个数\n",
"- Ticket => 船票信息\n",
"- Fare => 票价\n",
"- Cabin => 客舱\n",
"- Embarked => 登船港口"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(figsize=(9,5))\n",
"sns.heatmap(df_train.isnull(), cbar=False, cmap=\"YlGnBu_r\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 这些是类别列\n",
"cols = ['Survived', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nr_rows = 2\n",
"nr_cols = 3\n",
"fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*3.5,nr_rows*3))\n",
"for r in range(0,nr_rows):\n",
" for c in range(0,nr_cols): \n",
" \n",
" i = r*nr_cols+c \n",
" ax = axs[r][c]\n",
" sns.countplot(df_train[cols[i]], hue=df_train[\"Survived\"], ax=ax)\n",
" ax.set_title(cols[i])\n",
" ax.legend() \n",
" \n",
"plt.tight_layout() "
"cell_type": "markdown",
"metadata": {},
"source": [
"### 认识数据\n",
"- 第一张图:?\n",
"- 第二张图:?\n",
"- 第三张图:?\n",
"- 第四,五张图:?\n",
"- 第六张图: ?"
"cell_type": "markdown",
"metadata": {},
"source": [
"### 看看年龄的因素 "
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bins = np.arange(0, 80, 5)\n",
"g = sns.FacetGrid(df_train, row='Sex', col='Pclass', hue='Survived', margin_titles=True, size=3, aspect=1.1)\n",
"g.map(sns.distplot, 'Age', kde=False, bins=bins, hist_kws=dict(alpha=0.6))\n",
"g.add_legend() \n",
"plt.show() "
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 分析一下"
"cell_type": "markdown",
"metadata": {},
"source": [
"### 看看你票价因素 "
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bins = np.arange(0, 550, 20)\n",
"g = sns.FacetGrid(df_train, row='Sex', col='Pclass', hue='Survived', margin_titles=True, size=3, aspect=1.1)\n",
"g.map(sns.distplot, 'Fare', kde=False, bins=bins, hist_kws=dict(alpha=0.6))\n",
"g.add_legend() \n",
"plt.show() "
"cell_type": "markdown",
"metadata": {},
"source": [
"### 仓位因素"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x='Pclass', y='Survived', data=df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of Pclass\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x='Sex', y='Survived', hue='Pclass', data=df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of Pclass and Sex\")\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### 登船口因素"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x='Embarked', y='Survived', data=df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of Embarked Port\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(x='Embarked', y='Fare', data=df_train)\n",
"plt.title(\"Fare distribution as function of Embarked Port\")\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"## 增加一些新维度\n",
"### 家庭大小,单独,名字长度,称呼"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"for df in [df_train, df_test] :\n",
" \n",
" df['FamilySize'] = df['SibSp'] + df['Parch']\n",
" \n",
" df['Alone']=0\n",
" df.loc[(df.FamilySize==0),'Alone'] = 1\n",
" \n",
" df['NameLen'] = df.Name.apply(lambda x : len(x)) \n",
" df['NameLenBin']=np.nan\n",
" for i in range(20,0,-1):\n",
" df.loc[ df['NameLen'] <= i*5, 'NameLenBin'] = i\n",
" \n",
" \n",
" df['Title']=0\n",
" df['Title']=df.Name.str.extract(r'([A-Za-z]+)\\.') #lets extract the Salutations\n",
" df['Title'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],\n",
" ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x='NameLenBin' , y='Survived' , data = df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of NameLenBin\")\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### 结论??"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"g = sns.factorplot(x=\"NameLenBin\", y=\"Survived\", col=\"Sex\", data=df_train, kind=\"bar\", size=5, aspect=1.2)"
"cell_type": "markdown",
"metadata": {},
"source": [
"### 结论??\n",
"### 称呼因素"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x='Title' , y='Survived' , data = df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of Title\")\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.crosstab(df_train.FamilySize,df_train.Survived).apply(lambda r: r/r.sum(), axis=1).style.background_gradient(cmap='summer_r')"
"cell_type": "markdown",
"metadata": {},
"source": [
"### 结论??"
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据清洗\n",
"### 第一步填充缺失数据"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 根据称呼补充他们的性别\n",
"df_train['Title'] = df_train['Title'].fillna(df_train['Title'].mode().iloc[0])\n",
"# 年龄使用平均值填充\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Mr'),'Age']= df_train.Age[df_train.Title==\"Mr\"].mean()\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Mrs'),'Age']= df_train.Age[df_train.Title==\"Mrs\"].mean()\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Master'),'Age']= df_train.Age[df_train.Title==\"Master\"].mean()\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Miss'),'Age']= df_train.Age[df_train.Title==\"Miss\"].mean()\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Other'),'Age']= df_train.Age[df_train.Title==\"Other\"].mean()\n",
"df_train = df_train.drop('Name', axis=1)"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 设置登船口默认值是第一个\n",
"df_train['Embarked'] = df_train['Embarked'].fillna(df_train['Embarked'].mode().iloc[0])\n",
"# 票价用平均值填充\n",
"df_train['Fare'] = df_train['Fare'].fillna(df_train['Fare'].mean())"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 年龄按10年分段,票价按50分段,方便查找规律\n",
"df = df_train\n",
"for i in range(8,0,-1):\n",
" df.loc[ df['Age'] <= i*10, 'Age_bin'] = i\n",
"for i in range(12,0,-1):\n",
" df.loc[ df['Fare'] <= i*50, 'Fare_bin'] = i \n",
"# 把文字变成数字,让计算机更好处理\n",
"df['Title'] = df['Title'].map( {'Other':0, 'Mr': 1, 'Master':2, 'Miss': 3, 'Mrs': 4 } )\n",
"# 如果称呼为空,填充第一个\n",
"df['Title'] = df['Title'].fillna(df['Title'].mode().iloc[0])\n",
"df['Title'] = df['Title'].astype(int) "
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 复制一份数据,保护原始数据\n",
"df_train_ml = df_train.copy()"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 把类别参数做成新的列,用0-1表示对应项\n",
"df_train_ml = pd.get_dummies(df_train_ml, columns=['Sex', 'Embarked', 'Pclass'], drop_first=True)\n",
"df_train_ml.drop(['PassengerId','Ticket','Cabin','Age', 'Fare_bin'],axis=1,inplace=True)\n",
"df_train_ml.drop(['NameLen'], axis=1, inplace=True)\n",
"df_train_ml.drop(['SibSp'], axis=1, inplace=True)\n",
"df_train_ml.drop(['Parch'], axis=1, inplace=True)\n",
"df_train_ml.drop(['Alone'], axis=1, inplace=True)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"## 下一步就是机器学习,有兴趣同学可以看 sklearn\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": []
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
"nbformat": 4,
"nbformat_minor": 2
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"## 进程和线程\n",
"### 有两种解决方案:\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### 多进程\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import os\n",
"print('进程 (%s) 启动...' % os.getpid())\n",
"# Only works on Unix/Linux/Mac:\n",
"pid = os.fork()\n",
"if pid == 0:\n",
" print('我是子进程(%s) ,我的父进程是(%s).' % (os.getpid(), os.getppid()))\n",
" print('我 (%s) 创造了一个子进程 (%s).' % (os.getpid(), pid))"
"cell_type": "markdown",
"metadata": {},
"source": [
"## 由于Windows没有fork调用,上面的代码在Windows上无法运行"
"cell_type": "markdown",
"metadata": {},
"source": [
"### multiprocessing\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"from multiprocessing import Process\n",
"import os\n",
"import time"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 子进程要执行的代码\n",
"def run_proc(name):\n",
" time.sleep(10)\n",
" print('运行子进程 %s (%s)...' % (name, os.getpid()))"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"p = Process(target=run_proc, args=('test',))\n",
"print('父进程 %s.' % os.getpid())\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### 进程间通信\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"from multiprocessing import Process, Queue\n",
"import os, time, random\n",
"# 写数据进程执行的代码:\n",
"def write(q):\n",
" print('Process to write: %s' % os.getpid())\n",
" for value in ['A', 'B', 'C']:\n",
" print('Put %s to queue...' % value)\n",
" q.put(value)\n",
" time.sleep(random.random())\n",
"# 读数据进程执行的代码:\n",
"def read(q):\n",
" print('Process to read: %s' % os.getpid())\n",
" while True:\n",
" value = q.get(True)\n",
" print('Get %s from queue.' % value)"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 父进程创建Queue,并传给各个子进程:\n",
"q = Queue()\n",
"pw = Process(target=write, args=(q,))\n",
"pr = Process(target=read, args=(q,))\n",
"# 启动子进程pw,写入:\n",
"# 启动子进程pr,读取:\n",
"# 等待pw结束:\n",
"# pr进程里是死循环,无法等待其结束,只能强行终止:\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### 多线程\n",
"由于线程是操作系统直接支持的执行单元,因此,高级语言通常都内置多线程的支持,Python也不例外,并且,Python的线程是真正的Posix Thread,而不是模拟出来的线程。\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import time, threading, random"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 新线程执行的代码:\n",
"def loop():\n",
" print('线程 %s 运行中...' % threading.current_thread().name)\n",
" n = 0\n",
" while n < 5:\n",
" n = n + 1\n",
" print('线程 %s >>> %s' % (threading.current_thread().name, n))\n",
" time.sleep(1)\n",
" print('线程 %s 结束.' % threading.current_thread().name)\n",
"print('线程 %s 在运行...' % threading.current_thread().name)\n",
"t = threading.Thread(target=loop, name='LoopThread')\n",
"print('线程 %s 结束.' % threading.current_thread().name)"
"cell_type": "markdown",
"metadata": {},
"source": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### 应用例子\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import threading\n",
"import requests\n",
"import time,random\n",
"def func1(url):\n",
" print('打开网页%s, 模拟爬虫工作' % url)\n",
" res = requests.get(url)\n",
" time.sleep(random.randint(2,30))\n",
" print('结束,%s 返回结果 %s' % (url, res.status_code))\n",
"def func2(urlinfo):\n",
" for i in urlinfo:\n",
" th = threading.Thread(target=func1,args=[i])\n",
" th.start()\n",
" print('主程序结束')\n",
"urlinfo = ['http://www.sohu.com', 'http://www.163.com', 'http://www.sina.com']\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 练习 输出 0到11,观察他们的输出顺序\n",
"import threading\n",
"num = 0\n",
"def t():\n",
" global ???\n",
" num += 1\n",
" print(num)\n",
"for i in range(0, 11):\n",
" d = threading.Thread(???)\n",
" d.???"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": []
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
"nbformat": 4,
"nbformat_minor": 2
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"## 正则表达式\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### re.match函数\n",
"```re.match(pattern, string, flags=0)```\n",
"flags\t标志位,用于控制正则表达式的匹配方式,如:是否区分大小写,多行匹配等等。参见:正则表达式修饰符 - 可选标志"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"import re"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"print(re.match('www', 'www.163.com').span())\n",
"print(re.match('163', 'www.163.com'))"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"line = \"Cats Are smarter than dogs\"\n",
"matchObj = re.match( r'(.*) are (.*?) .*', line, re.I)\n",
" \n",
"if matchObj:\n",
" print(\"matchObj.group() : \", matchObj.group())\n",
" print(\"matchObj.group(1) : \", matchObj.group(1))\n",
" print(\"matchObj.group(2) : \", matchObj.group(2))"
"cell_type": "markdown",
"metadata": {},
"source": [
"1. re.I\t 使匹配对大小写不敏感\n",
"2. re.L\t 做本地化识别(locale-aware)匹配\n",
"3. re.M\t 多行匹配,影响 ^ 和 $\n",
"4. re.S\t 使 . 匹配包括换行在内的所有字符\n",
"5. re.U\t 根据Unicode字符集解析字符。这个标志影响 \\w, \\W, \\b, \\B.\n",
"6. re.X\t 该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。"
"cell_type": "markdown",
"metadata": {},
"source": [
"### re.search方法\n",
"re.search 扫描整个字符串并返回第一个成功的匹配。\n",
"```re.search(pattern, string, flags=0)```"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"print(re.search('www', 'www.163.com').span()) # 在起始位置匹配\n",
"print(re.search('163', 'www.163.com').span()) # 不在起始位置匹配"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"line = \"Cats Are smarter than dogs\"\n",
"matchObj = re.search( r'(.*) are (.*?) .*', line, re.I)\n",
" \n",
"if matchObj:\n",
" print(\"matchObj.group() : \", matchObj.group())\n",
" print(\"matchObj.group(1) : \", matchObj.group(1))\n",
" print(\"matchObj.group(2) : \", matchObj.group(2))"
"cell_type": "markdown",
"metadata": {},
"source": [
"#### re.match与re.search的区别\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 练习,找出“dogs”\n",
"line = \"Cats are smarter than dogs\";\n",
" \n",
"matchObj = re.???( r'dogs', line, re.I)\n",
"if matchObj:\n",
" print(\"match --> matchObj.group() : \", matchObj.group())\n",
" print(\"No match!!\")"
"cell_type": "markdown",
"metadata": {},
"source": [
"### 检索和替换\n",
"Python 的 re 模块提供了re.sub用于替换字符串中的匹配项。\n",
"```re.sub(pattern, repl, string, count=0, flags=0)```\n",
"- pattern : 正则中的模式字符串。\n",
"- repl : 替换的字符串,也可为一个函数。\n",
"- string : 要被查找替换的原始字符串。\n",
"- count : 模式匹配后替换的最大次数,默认 0 表示替换所有的匹配。"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"phone = \"0757-86547-1548 # 这是一个电话号码\"\n",
" \n",
"# 删除字符串中的 Python注释 \n",
"num = re.sub(r'#.*$', \"\", phone)\n",
"print(\"电话号码是: \", num)\n",
" \n",
"# 删除非数字(-)的字符串 \n",
"num = re.sub(r'\\D', \"\", phone)\n",
"#num = re.sub(r'-', \"\", num)\n",
"print(\"电话号码是 : \", num)"
"cell_type": "markdown",
"metadata": {},
"source": [
"- \\d\t匹配一个Unicode数字,如果带re.ASCII,则匹配0-9\n",
"- \\D 匹配Unicode非数字\n",
"- \\s\t匹配Unicode空白,如果带有re.ASCII,则匹配\\t\\n\\r\\f\\v中的一个\n",
"- \\S 匹配Unicode非空白\n",
"- \\w\t匹配Unicode单词字符,如果带有re.ascii,则匹配[a-zA-Z0-9_]中的一个\n",
"- \\W 匹配Unicode非单子字符"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 将匹配的数字乘以 2\n",
"def double(matched):\n",
" value = int(matched.group('value'))\n",
" return str(value * 2)\n",
" \n",
"s = 'A23G4HFD423'\n",
"print(re.sub('(?P<value>\\d+)', double, s))"
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
"name": "stdout",
"output_type": "stream",
"text": [
"{'province': '110', 'city': '223', 'born_year': '1990', 'born_month': '03', 'born_date': '06'}\n"
"source": [
"# 分组匹配\n",
"import re\n",
"s = '110223199003060030'\n",
"res = re.search('(?P<province>\\d{3})(?P<city>\\d{3})(?P<born_year>\\d{4})(?P<born_month>\\d{2})(?P<born_date>\\d{2})',s)\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 练习:不用正则表达方式实现同样功能"
"cell_type": "markdown",
"metadata": {},
"source": [
"### re.compile 函数\n",
"compile 函数用于编译正则表达式,生成一个正则表达式( Pattern )对象,供 match() 和 search() 这两个函数使用。\n",
"```re.compile(pattern[, flags])```\n",
"- pattern : 一个字符串形式的正则表达式\n",
"- flags : 可选,表示匹配模式,比如忽略大小写,多行模式等,具体参数为:\n",
"- re.I 忽略大小写\n",
"- re.L 表示特殊字符集 \\w, \\W, \\b, \\B, \\s, \\S 依赖于当前环境\n",
"- re.M 多行模式\n",
"- re.S 即为 . 并且包括换行符在内的任意字符(. 不包括换行符)\n",
"- re.U 表示特殊字符集 \\w, \\W, \\b, \\B, \\d, \\D, \\s, \\S 依赖于 Unicode 字符属性数据库\n",
"- re.X 为了增加可读性,忽略空格和 # 后面的注释"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"pattern = re.compile(r'\\d+')\n",
"m = pattern.match('one12twothree34four')\n",
"m = pattern.match('one12twothree34four', 3, 10)\n",
"print(m.start(0), m.end(0), m.span())"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)\n",
"m = pattern.match('Hello World Wide Web')\n",
"print(m.group(0)) # 返回匹配成功的整个子串\n",
"print(m.group(1)) # 返回第一个分组匹配成功的子串\n",
"print(m.group(2)) # 返回第二个分组匹配成功的子串\n",
"print(m.groups()) # 等价于 (m.group(1), m.group(2), ...)"
"cell_type": "markdown",
"metadata": {},
"source": [
"### findall\n",
"注意: match 和 search 是匹配一次 findall 匹配所有。\n",
"```findall(string[, pos[, endpos]])```\n",
"- string : 待匹配的字符串。\n",
"- pos : 可选参数,指定字符串的起始位置,默认为 0。\n",
"- endpos : 可选参数,指定字符串的结束位置,默认为字符串的长度。"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"pattern = re.compile(r'\\d+')\n",
"m = pattern.findall('one12twothree34four')\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### re.split\n",
"split 方法按照能够匹配的子串将字符串分割后返回列表,它的使用形式如下:\n",
"```re.split(pattern, string[, maxsplit=0, flags=0])```"
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"res = re.split('\\W+', 'abx, 123sd, good.')\n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": [
"# 练习不用正则表达,实现相同功能\n",
"s = 'sd1xxx2aa2a3sd3xx12yy'\n",
"res = re.split('\\d+', s)\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"## 正则表达式模式\n",
"由于正则表达式通常都包含反斜杠,所以你最好使用原始字符串来表示它们。模式元素(如 r'\\t',等价于 '\\\\t')匹配相应的特殊字符。\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"### 基本符号\n",
"- ^ 表示匹配字符串的开始位置 (例外 用在中括号中[ ] 时,可以理解为取反,表示不匹配括号中字符串)\n",
"- $ 表示匹配字符串的结束位置\n",
"- * 表示匹配 零次到多次\n",
"- + 表示匹配 一次到多次 (至少有一次)\n",
"- ? 表示匹配零次或一次\n",
"- . 表示匹配单个字符 \n",
"- | 表示为或者,两项中取一项\n",
"- ( ) 小括号表示匹配括号中全部字符\n",
"- [ ] 中括号表示匹配括号中一个字符 范围描述 如[0-9 a-z A-Z]\n",
"- { } 大括号用于限定匹配次数 如 {n}表示匹配n个字符 {n,}表示至少匹配n个字符 {n,m}表示至少n,最多m\n",
"- \\ 转义字符 如上基本符号匹配都需要转义字符 如 \\* 表示匹配*号\n",
"- \\w 表示英文字母和数字 \\W 非字母和数字\n",
"- \\d 表示数字 \\D 非数字\n",
"cell_type": "markdown",
"metadata": {},
"source": [
"匹配中文字符的正则表达式: [\\u4e00-\\u9fa5]\n",
"匹配空行的正则表达式:\\n[\\s| ]*\\r\n",
"匹配HTML标记的正则表达式:/<(.*)>.*<\\/\\1>|<(.*) \\/>/ \n",
"匹配IP地址的正则表达式:/(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)/g //\n",
"匹配网址URL的正则表达式:http://(/[\\w-]+\\.)+[\\w-]+(/[\\w- ./?%&=]*)?\n",
"1、非负整数:^\\d+$ \n",
"2、正整数:^[0-9]*[1-9][0-9]*$ \n",
"3、非正整数:^((-\\d+)|(0+))$ \n",
"4、负整数:^-[0-9]*[1-9][0-9]*$ \n",
"5、整数:^-?\\d+$ \n",
"6、非负浮点数:^\\d+(\\.\\d+)?$ \n",
"7、正浮点数:^((0-9)+\\.[0-9]*[1-9][0-9]*)|([0-9]*[1-9][0-9]*\\.[0-9]+)|([0-9]*[1-9][0-9]*))$ \n",
"8、非正浮点数:^((-\\d+\\.\\d+)?)|(0+(\\.0+)?))$ \n",
"9、负浮点数:^(-((正浮点数正则式)))$ \n",
"10、英文字符串:^[A-Za-z]+$ \n",
"11、英文大写串:^[A-Z]+$ \n",
"12、英文小写串:^[a-z]+$ \n",
"13、英文字符数字串:^[A-Za-z0-9]+$ \n",
"14、英数字加下划线串:^\\w+$ \n",
"15、E-mail地址:^[\\w-]+(\\.[\\w-]+)*@[\\w-]+(\\.[\\w-]+)+$ \n",
"16、URL:^[a-zA-Z]+://(\\w+(-\\w+)*)(\\.(\\w+(-\\w+)*))*(\\?\\s*)?$ \n",
"23、匹配HTML标记:<(.*)>.*<\\/\\1>|<(.*) \\/> \n",
"24、匹配空行:\\n[\\s| ]*\\r\n",
"25、提取信息中的网络链接:(h|H)(r|R)(e|E)(f|F) *= *('|\")?(\\w|\\\\|\\/|\\.)+('|\"| *|>)?\n",
"27、提取信息中的图片链接:(s|S)(r|R)(c|C) *= *('|\")?(\\w|\\\\|\\/|\\.)+('|\"| *|>)?\n",
"34、提取信息中的任何数字 :(-?\\d*)(\\.\\d+)? \n",
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
"outputs": [],
"source": []
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
"nbformat": 4,
"nbformat_minor": 2
"cells": [
"cell_type": "markdown",
"metadata": {},
"source": [
"### Requests\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests # 如果这里出错,证明你还没有安装这个库"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"r = requests.get('https://www.toutiao.com/') # 今日头条\n",
"print(\"查看返回状态\", r.status_code) # 200代表成功 ,404, 403, 501这些意思可以百度查一下"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 查看一下内容\n",
"print(r.text) # 返回正常的网页内容, 即解压解码之后的内容\n",
"print(r.content) # 返回byte类型的网页内容, 即值解压, 没有解码\n",
"print(r.json()) # 如果网页内容为json, 直接返回一个json对象\n",
"print(r.encoding) # 返回网页的编码: \"utf-8\"\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 网页表头信息\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from urllib.parse import urlencode\n",
"# 获取一些有意思的内容\n",
"def get_page(offset):\n",
" params = {\n",
" 'offset': offset,\n",
" 'format': 'json',\n",
" 'keyword': '搞笑',\n",
" 'autoload': 'true',\n",
" 'count': '20',\n",
" 'cur_tab': '3',\n",
" 'from': 'gallery',\n",
" }\n",
" url = 'https://www.toutiao.com/search_content/?' + urlencode(params)\n",
" try:\n",
" print(url)\n",
" response = requests.get(url)\n",
" if response.status_code == 200:\n",
" return response.json()\n",
" except requests.ConnectionError:\n",
" return None"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"contents = get_page(1)"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 分析一下结构\n",
"data = contents.get('data')\n",
"all_images = {}\n",
"if data:\n",
" for item in data:\n",
" # print(item)\n",
" image_list = item.get('image_list')\n",
" title = item.get('title')\n",
" item_id = item.get('id')\n",
" # print(image_list)\n",
" imgs = []\n",
" for image in image_list:\n",
" imgs.append(image.get('url')[2:])\n",
" \n",
" all_images[item_id] = {\n",
" 'title': title,\n",
" 'images': imgs\n",
" }\n",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习:保存图片 提示: os.path, 字符串处理(+http, 替换list->large, 文档操作)\n",
"# 建议使用Pycharm来写"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 补充知识\n",
"# 不同方式获取网页内容, 返回一个Response对象, 请求的参数可以为url或Request对象\n",
"r0 = requests.get(\"https://github.com/timeline.json\")\n",
"r1 = requests.post(\"http://httpbin.org/post\")\n",
"r2 = requests.put(\"http://httpbin.org/put\")\n",
"r3 = requests.delete(\"http://httpbin.org/delete\")\n",
"r4 = requests.head(\"http://httpbin.org/get\")\n",
"r5 = requests.options(\"http://httpbin.org/get\")\n",
"r6 = requests.patch(\"http://httpbin.org/get\")\n",
"# 定制请求头: 一个字典\n",
"headers = {\"user-agent\": \"my-app/0.0.1\"}\n",
"r = requests.get(\"https://api.github.com/some/endpoint\", headers=headers)\n",
"print(r.request.headers) # 获取request的头部\n",
"print(r.headers) # 获取response的头部\n",
"# 模拟一个手机的UA\n",
"# Mozilla/5.0 (Linux; Android 8.1.0; ALP-AL00 Build/HUAWEIALP-AL00; wv) \n",
"# AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/63.0.3239.83 \n",
"# Mobile Safari/537.36 T7/10.13 baiduboxapp/ (Baidu; P1 8.1.0)\n",
"# {\n",
"# \"content-encoding\": \"gzip\",\n",
"# \"transfer-encoding\": \"chunked\",\n",
"# \"connection\": \"close\",\n",
"# \"server\": \"nginx/1.0.4\",\n",
"# \"x-runtime\": \"148ms\",\n",
"# \"etag\": \"e1ca502697e5c9317743dc078f67693f\",\n",
"# \"content-type\": \"application/json\"\n",
"# }\n",
"print(r.headers[\"Content-Type\"]) # \"application/json\"\n",
"print(r.headers.get(\"content-type\")) # \"application/json\"\n",
"# 更加复杂的POST请求: 表单\n",
"post_dict = {\"key1\": \"value1\", \"key2\": \"value2\"}\n",
"r = requests.post(\"http://httpbin.org/post\", data=post_dict)\n",
"# 要想发送你的cookies到服务器, 可以使用cookies参数(一个字典)\n",
"cookies = {\"cookies_are\": \"working\"}\n",
"r = requests.get(\"http://httpbin.org/cookies\", cookies=cookies)\n",
"# 会话对象: 会话对象让你能够跨请求保持某些参数, 它也会在同一个Session实例发出的所有请求之间保持cookie\n",
"s = requests.Session()\n",
"for cookie in s.cookies:\n",
" print(cookie)\n",
"# 如果你要手动为会话添加cookie, 就是用Cookie utility函数来操纵Session.cookies\n",
"requests.utils.add_dict_to_cookiejar(s.cookies, {\"cookie_key\": \"cookie_value\"})\n",
"# 会话也可用来为请求方法提供缺省数据, 这是通过为会话对象的属性提供数据来实现的\n",
"s.auth = (\"user\", \"pass\")\n",
"s.headers.update({\"x-test\": \"true\"})\n",
"s.get(\"http://httpbin.org/headers\", headers={\"x-test2\": \"true\"})\n",
"# both \"x-test\" and \"x-test2\" are sent"
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"nbformat": 4,
"nbformat_minor": 2
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
想要评论请 注册