提交 ba83d019 编写于 作者: 懂一点的陈老师's avatar 懂一点的陈老师

update all chapter

上级 afa9394d
# Python入门与数据分析基础
## Python入门教程
|章节|文件名|
|-|-|
|基本概念|chapter_1.ipynb|
|常用数据类型|chapter_2_structure.ipynb|
|字符串处理和格式化输出|chapter_3_string_format.ipynb|
|字典-列表-元组|chapter_4_dict_list_tuple.ipynb|
|函数|chapter_5_function.ipynb|
|对象|chapter_6_oop.ipynb|
|文件读写|chapter_7_IO_exec.ipynb|
|常用模块介绍|chapter_8_package.ipynb|
|爬虫入门|chapter_9_Requests.ipynb|
|数据库连接|chapter_10_DB.ipynb|
|Web应用|chapter_11_Web.ipynb|
|Numpy介绍|chapter_12_Numpy.ipynb|
|Pandas介绍|chapter_13_Pandas.ipynb|
|数据可视化|chapter_14_Vis.ipynb|
|数据分析实例|chapter_14_Vis_example.ipynb|
|并行运算|chapter_15_thread_multiprocee.ipynb|
|正则表达式|chapter_16_regexp.ipynb|
* 第一章:基本概念
* 第二章:常用数据类型
* 第三章:字符串处理和格式化输出
[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/liangxuCHEN/A2Z_python.git/master)
......
......@@ -13,17 +13,9 @@
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": null,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"1 100 -80 0\n"
]
}
],
"outputs": [],
"source": [
"a1 = 1\n",
"a2 = 100\n",
......@@ -438,7 +430,7 @@
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
"version": "3.6.1"
}
},
"nbformat": 4,
......
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### SQLite数据库\n",
"\n",
"SQLite是一种嵌入式数据库,它的数据库就是一个文件。由于SQLite本身是C写的,而且体积很小,所以,经常被集成到各种应用程序中,甚至在iOS和Android的App中都可以集成。\n",
"\n",
"windows 安装 [下载](https://www.sqlite.org/download.html)\n",
"\n",
"## 使用SQLite\n",
"Python就内置了SQLite3,所以,在Python中使用SQLite,不需要安装任何东西,直接使用。\n",
"\n",
"在使用SQLite前,我们先要搞清楚几个概念:\n",
"\n",
"表是数据库中存放关系数据的集合,一个数据库里面通常都包含多个表,比如学生的表,班级的表,学校的表,等等。表和表之间通过外键关联。\n",
"\n",
"要操作关系数据库,首先需要连接到数据库,一个数据库连接称为Connection;\n",
"\n",
"连接到数据库后,需要打开游标,称之为Cursor,通过Cursor执行SQL语句,然后,获得执行结果。\n",
"\n",
"Python定义了一套操作数据库的API接口,任何数据库要连接到Python,只需要提供符合Python标准的数据库驱动即可。\n",
"\n",
"由于SQLite的驱动内置在Python标准库中,所以我们可以直接来操作SQLite数据库。\n",
"\n",
"我们在Python交互式命令行实践一下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 导入SQLite驱动:\n",
"import sqlite3\n",
"# 连接到SQLite数据库\n",
"# 数据库文件是test.db\n",
"# 如果文件不存在,会自动在当前目录创建:\n",
"\n",
"# 删掉已经存在的数据库\n",
"db_file = 'test.db'\n",
"if os.path.isfile(db_file):\n",
" os.remove(db_file)\n",
"\n",
"conn = sqlite3.connect('test.db')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建一个Cursor:\n",
"cursor = conn.cursor()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 执行一条SQL语句,创建user表:\n",
"cursor.execute('create table user (id varchar(20) primary key, name varchar(20))')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 继续执行一条SQL语句,插入一条记录:\n",
"cursor.execute(\"insert into user (id, name) values ('5', 'Mike')\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 通过rowcount获得插入的行数:\n",
"print(\"行数\",cursor.rowcount)\n",
"# 关闭Cursor:\n",
"cursor.close()\n",
"# 提交事务:\n",
"conn.commit()\n",
"# 关闭Connection:\n",
"conn.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 查询记录\n",
"conn = sqlite3.connect('test.db')\n",
"cursor = conn.cursor()\n",
"# 执行查询语句:\n",
"cursor.execute('select * from user' )\n",
"# 获得查询结果集:\n",
"print(cursor.fetchall())\n",
"cursor.close()\n",
"conn.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习\n",
"import os, sqlite3\n",
"\n",
"db_file = 'exec.db'\n",
"if os.path.isfile(db_file):\n",
" os.remove(db_file)\n",
"\n",
"# 创建表格\n",
"conn = sqlite3.connect(db_file)\n",
"cursor = conn.cursor()\n",
"cursor.execute('create table student (id varchar(20) primary key, name varchar(20), score int)')\n",
"\n",
"# 插入数据\n",
"cursor.execute(\"insert into student values ('A-001', 'Adam', 95)\")\n",
"# 再插入两条数据\n",
"cursor.execute(???)\n",
"\n",
"cursor.close()\n",
"conn.commit()\n",
"conn.close()\n",
"\n",
"\n",
"def get_all():\n",
" # 连接到SQLite数据库\n",
" # 执行一条SQL语句\n",
" # 获得查询结果集\n",
" ???\n",
" print(all_value)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## SQLAlchemy\n",
"\n",
"SQLAlchemy是python的一个数据库ORM工具,提供了强大的对象模型间的转换,可以满足绝大多数数据库操作的需求,并且支持多种数据库引擎(sqlite,mysql,postgres, mongodb等)\n",
"\n",
"首先是连接到数据库,SQLALchemy支持多个数据库引擎,不同的数据库引擎连接字符串不一样,常用的有\n",
"\n",
"```\n",
"数据库类型+数据库驱动名称://用户名:口令@机器地址:端口号/数据库名\n",
"```\n",
"\n",
"```\n",
"mssql+pymssql://username:password@hostname/database\n",
"mssql://username:password@hostname/database\n",
"postgresql://username:password@hostname/database\n",
"sqlite:////absolute/path/to/database\n",
"sqlite:///c:/absolute/path/to/database\n",
"```\n",
"更多连接字符串的介绍参见[https://docs.sqlalchemy.org/en/latest/core/engines.html?highlight=create_engine#database-urls]\n",
"\n",
"下面是连接和使用sqlite数据库的例子"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### connection\n",
"使用传统的connection的方式连接和操作数据库"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sqlalchemy import create_engine"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 数据库连接字符串\n",
"DB_CONNECT_STRING = 'sqlite:///test.db'\n",
"\n",
"# 创建数据库引擎,echo为True,会打印所有的sql语句\n",
"engine = create_engine(DB_CONNECT_STRING, echo=True)\n",
"\n",
"# 创建一个connection,这里的使用方式与python自带的sqlite的使用方式类似\n",
"with engine.connect() as con:\n",
" # 执行sql语句,如果是增删改,则直接生效,不需要commit\n",
" rs = con.execute('select * from user')\n",
" data = rs.fetchone()\n",
" print(\"Data: %s\" % data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### connection事务\n",
"使用事务可以进行批量提交和回滚"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DB_CONNECT_STRING = 'sqlite:///test.db'\n",
"engine = create_engine(DB_CONNECT_STRING)\n",
"\n",
"with engine.connect() as connection:\n",
" trans = connection.begin()\n",
" try:\n",
" r0 = connection.execute(\"create table book (id varchar(20) primary key, name varchar(20), user_id varchar(20))\")\n",
" #r0 = connection.execute(\"create table user (id varchar(20) primary key, name varchar(20))\")\n",
" except:\n",
" print(\"已经有这个数据库,不用创建,继续...\")\n",
" try:\n",
" \n",
" r1 = connection.execute(\"insert into book (id,name, user_id) values ('3', 'Lucxx', '2')\")\n",
" r2 = connection.execute(\"select * from book\")\n",
" trans.commit()\n",
" print(r2.fetchall())\n",
" except:\n",
" trans.rollback()\n",
" raise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## session\n",
"connection是一般使用数据库的方式,sqlalchemy还提供了另一种操作数据库的方式,通过session对象,session可以记录和跟踪数据的改变,在适当的时候提交,并且支持强大的ORM的功能,下面是基本使用"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sqlalchemy.orm import sessionmaker"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 常用模式\n",
"# 数据库连接字符串\n",
"DB_CONNECT_STRING = 'sqlite:///test.db'\n",
"\n",
"# 创建数据库引擎,echo为True,会打印所有的sql语句\n",
"engine = create_engine(DB_CONNECT_STRING, echo=True)\n",
"\n",
"# 创建会话类\n",
"DB_Session = sessionmaker(bind=engine)\n",
"\n",
"# 创建会话对象\n",
"session = DB_Session()\n",
"\n",
"# 在回话中处理数据库操作\n",
"\"\"\"\n",
"创建表\n",
"获取数据\n",
"插入数据\n",
"修改数据\n",
"\"\"\"\n",
"# 如果再次运行,不要运行创建表\n",
"#session.execute(\"create table member (id varchar(20) primary key, name varchar(20))\")\n",
"\n",
"session.execute(\"insert into member(id, name) values('3', '小样')\")\n",
"session.commit() #来确认修改和增加的内容\n",
"\n",
"# 用完记得关闭,也可以用with\n",
"session.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"上面创建了一个session对象,接下来可以操作数据库了,session也支持通过sql语句操作数据库"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习 链接数据库,创建一个学生信息表(Student), 字段: id, name, age, 插入一条数据: (1, Tom, 19)\n",
"\n",
"from sqlalchemy import create_engine\n",
"\n",
"# 数据库链接设置\n",
"DB_CONNECT_STRING = ???\n",
"engine = create_engine(DB_CONNECT_STRING, echo=True)\n",
"\n",
"with engine.connect() as connection:\n",
" trans = ???\n",
" ???:\n",
" r0 = connection.???(\"create table Student (id varchar(20) primary key, name varchar(20), age int)\")\n",
" ???:\n",
" print(\"已经有这个数据库,不用创建,继续...\")\n",
" try:\n",
" \n",
" r1 = connection.???(\"insert into Student (id,name,age) values ('1', 'Tom', 19)\")\n",
" trans.???\n",
" except:\n",
" trans.rollback()\n",
" raise"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### ORM\n",
"\n",
"Object-Relational Mapping,把关系数据库的表结构映射到对象上\n",
"\n",
"上面的member用class实例来表示"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"class Member(object):\n",
" def __init__(self, id, name):\n",
" self.id = id\n",
" self.name = name"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 第一步,导入SQLAlchemy,并初始化DBSession:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 导入:\n",
"from sqlalchemy import Column, String, create_engine\n",
"from sqlalchemy.orm import sessionmaker\n",
"from sqlalchemy.ext.declarative import declarative_base"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建对象的基类:\n",
"Base = declarative_base()\n",
"\n",
"# 数据库连接字符串\n",
"DB_CONNECT_STRING = 'sqlite:///db.sqlite'\n",
"\n",
"# 定义User对象:\n",
"class User(Base):\n",
" # 表的名字:\n",
" __tablename__ = 'user'\n",
"\n",
" # 表的结构:\n",
" id = Column(String(20), primary_key=True)\n",
" name = Column(String(20))\n",
"\n",
"# 初始化数据库连接:\n",
"engine = create_engine(DB_CONNECT_STRING)\n",
"# 创建DBSession类型:\n",
"DBSession = sessionmaker(bind=engine)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"下面,我们看看如何向数据库表中添加一行记录。\n",
"\n",
"由于有了ORM,我们向数据库表中添加一行记录,可以视为添加一个User对象:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建session对象:\n",
"session = DBSession()\n",
"# 创建新User对象:\n",
"new_user = User(id='2', name='Bob')\n",
"# 添加到session:\n",
"session.add(new_user)\n",
"# 提交即保存到数据库:\n",
"session.commit()\n",
"# 关闭session:\n",
"session.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 从数据库表中查询数据\n",
"# 创建Session:\n",
"session = DBSession()\n",
"# 创建Query查询,filter是where条件,最后调用one()返回唯一行,如果调用all()则返回所有行:\n",
"user = session.query(User).filter(User.id=='1').one()\n",
"# 打印类型和对象的name属性:\n",
"print('type:', type(user))\n",
"print('name:', user.name)\n",
"# 关闭Session:\n",
"session.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"可见,ORM就是把数据库表的行与相应的对象建立关联,互相转换。\n",
"\n",
"由于关系数据库的多个表还可以用外键实现一对多、多对多等关联,相应地,ORM框架也可以提供两个对象之间的一对多、多对多等功能。\n",
"\n",
"例如,如果一个User拥有多个Book,就可以定义一对多关系如下:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from sqlalchemy import ForeignKey\n",
"# 创建一个书的类\n",
"class Book(Base):\n",
" __tablename__ = 'book'\n",
"\n",
" id = Column(String(20), primary_key=True)\n",
" name = Column(String(20))\n",
" # “多”的一方的book表是通过外键关联到user表的:\n",
" user_id = Column(String(20), ForeignKey('user.id'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"当我们查询一个User对象时,该对象的books属性将返回一个包含若干个Book对象的list"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 创建session对象:\n",
"session = DBSession()\n",
"# 创建新User对象:\n",
"new_user = User(id='21', name='Kerry')\n",
"# 添加到session:\n",
"session.add(new_user)\n",
"new_book = Book(id='10', name='Learn Python', user_id = new_user.id)\n",
"print('书本名字:%s, 用户:%s' % (new_book.name, new_user.name))\n",
"session.add(new_book)\n",
"# 提交即保存到数据库:\n",
"session.commit()\n",
"# 关闭session:\n",
"session.close()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习1 从数据库表中查询book数据\n",
"# 创建Session:\n",
"session = DBSession()\n",
"# 创建Query查询,filter是where条件,最后调用one()返回唯一行,如果调用all()则返回所有行:\n",
"book = session.query(???).filter(???.id=='3').one()\n",
"# 打印结果\n",
"print('book id', ???)\n",
"print('book name:', ???)\n",
"# 关闭Session:\n",
"session.close()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习\n",
"db_test.py\n",
"\n",
"### 总结\n",
"db_example.py"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Web应用介绍\n",
"\n",
"![web application](http://intelhunt.com/images/Web-Application-Development1.jpg)\n",
"\n",
"Web应用程序是一种可以通过Web访问的应用程序,程序的最大好处是用户很容易访问应用程序,用户只需要有浏览器即可,不需要再安装其他软件。\n",
"\n",
"HTML是一种用来定义网页的文本,会HTML,就可以编写网页;\n",
"\n",
"HTTP是在网络上传输HTML的协议,用于浏览器和服务器的通信。\n",
"\n",
"调试Web应用推荐使用<strong>Chrome浏览器</strong>\n",
"\n",
"### HTML\n",
"\n",
"HTML文档就是一系列的Tag组成,最外层的Tag是\\<html\\>。规范的HTML也包含\\<head\\>...\\</head\\>和\\<body\\>...\\</body\\>(注意不要和HTTP的Header、Body搞混了),由于HTML是富文档模型,所以,还有一系列的Tag用来表示链接、图片、表格、表单等等\n",
" \n",
"### CSS\n",
"\n",
"CSS是Cascading Style Sheets(层叠样式表)的简称,CSS用来控制HTML里的所有元素如何展现\n",
"\n",
"### JavaScript简介\n",
"\n",
"JavaScript虽然名称有个Java,但它和Java真的一点关系没有。JavaScript是为了让HTML具有交互性而作为脚本语言添加的,JavaScript既可以内嵌到HTML中,也可以从外部链接到HTML中。如果我们希望当用户点击标题时把标题变成红色,就必须通过JavaScript来实现:\n",
"\n",
"如果要学习Web开发,首先要对HTML、CSS和JavaScript作一定的了解。HTML定义了页面的内容,CSS来控制页面元素的样式,而JavaScript负责页面的交互逻辑。\n",
"\n",
"讲解HTML、CSS和JavaScript就可以写3本书,对于优秀的Web开发人员来说,精通HTML、CSS和JavaScript是必须的,这里推荐一个在线学习网站w3schools:\n",
"\n",
"http://www.w3school.com.cn/\n",
"\n",
"当我们用Python或者其他语言开发Web应用时,我们就是要在服务器端动态创建出HTML,这样,浏览器就会向不同的用户显示出不同的Web页面。\n",
"\n",
"### Web应用例子\n",
"\n",
"![application](https://daproim.com/wp-content/uploads/2017/09/webdesign-1.png)\n",
"\n",
"\n",
"### 静态网站和动态网站\n",
"\n",
"1. http://www.wuzhen.com.cn/\n",
"2. https://similar.ai/\n",
"3. https://www.taobao.com/\n",
"4. https://www.toutiao.com/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"编写一个server.py,负责启动WSGI服务器,加载application()函数:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# server.py\n",
"# 从wsgiref模块导入:\n",
"from wsgiref.simple_server import make_server\n",
"# 导入我们自己编写的application函数:\n",
"# from hello import application\n",
"\n",
"def application(environ, start_response):\n",
" start_response('200 OK', [('Content-Type', 'text/html')])\n",
" return [b'<h1>Hello, web!</h1>']\n",
"\n",
"# 创建一个服务器,IP地址为空,端口是8000,处理函数是application:\n",
"httpd = make_server('', 8000, application)\n",
"print('Serving HTTP on port 8000...')\n",
"# 开始监听HTTP请求:\n",
"httpd.serve_forever()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"## 练习\n",
"\n",
"### 参考上面的例子,开启一个服务器"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Flask框架\n",
"\n",
"WSGI提供的接口虽然比HTTP接口高级了不少,但和Web App的处理逻辑比,还是比较低级,我们需要在WSGI接口之上能进一步抽象,让我们专注于用一个函数处理一个URL,至于URL到函数的映射,就交给Web框架来做\n",
"\n",
"用Python开发一个Web框架十分容易,所以Python有上百个开源的Web框架。这里我们先不讨论各种Web框架的优缺点,直接选择一个比较流行的Web框架——Flask来使用"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 再来一个Hello\n",
"from flask import Flask\n",
"app = Flask(__name__)\n",
"\n",
"@app.route('/')\n",
"def hello_world():\n",
" return 'Hello, World!'"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"更多内容,请到[官网查看](http://docs.jinkan.org/docs/flask/quickstart.html)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 练习:\n",
"\n",
"先看 flask_post.py 和 flask_example.py\n",
"\n",
"再看 flask_api.py,完成里面练习,需要用到之前的db_test.py 和 db.sqlite 数据库"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
此差异已折叠。
此差异已折叠。
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Matplotlib [官网](https://matplotlib.org/users/pyplot_tutorial.html)\n",
"\n",
"Matplotlib的设计理念是能够用轻松简单的方式生成强大的可视化效果,是Python学习过程中核心库之一。\n",
"\n",
"用在python中绘制数组的2D图形库\n",
"\n",
"matplotlib代码在概念上分为3个部分:\n",
"\n",
"1.pylab接口是由matplotlib.pylab提供的函数集,允许用户使用非常类似于MATLAB图生成代码的代码创建绘图\n",
"\n",
"2.matplotlib前端或API是一组重要的类,可创建和管理图形,文本,线条,图表等(艺术家教程),是一个对输出无所了解的抽象接口\n",
"\n",
"3.后端是设备相关的绘图设备,也称为渲染器,将前端表示转换为打印件或显示设备;后端示例:PS 创建 PostScript® 打印件,SVG 创建可缩放矢量图形打印件,Agg 使用 Matplotlib 附带的高质量反颗粒几何库创建 PNG 输出,GTK 在 Gtk+ 应用程序中嵌入 Matplotlib,GTKAgg 使用反颗粒渲染器创建图形并将其嵌入到 Gtk+ 应用程序中,以及用于 PDF,WxWidgets,Tkinter 等\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import matplotlib\n",
"import matplotlib.mlab as mlab\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 例子1\n",
"\n",
"def simple_plot():\n",
" # 生成测试数据\n",
" x = np.linspace(-np.pi, np.pi, 256, endpoint=True)\n",
" y_cos, y_sin = np.cos(x), np.sin(x)\n",
"\n",
" # 生成画布,并设定标题\n",
" # 画布大小,dpi=清晰度\n",
" plt.figure(figsize=(8, 6), dpi=80)\n",
" plt.title(\"Simple plot\")\n",
" plt.grid(True) # 带网格\n",
"\n",
" # 设置X轴\n",
" plt.xlabel(\"X\")\n",
" plt.xlim(-4.0, 4.0)\n",
" plt.xticks(np.linspace(-4, 4, 9, endpoint=True))\n",
"\n",
" # 设置Y轴\n",
" plt.ylabel(\"Y\")\n",
" plt.ylim(-1.0, 1.0)\n",
" plt.yticks(np.linspace(-1, 1, 9, endpoint=True))\n",
"\n",
" # 画两条曲线\n",
" plt.plot(x, y_cos, \"b--\", linewidth=2.0, label=\"cos\")\n",
" plt.plot(x, y_sin, \"g-\", linewidth=2.0, label=\"sin\")\n",
"\n",
" # 设置图例位置,loc可以为[upper, lower, left, right, center]\n",
" plt.legend(loc=\"upper left\",shadow=True) \n",
"\n",
" # 图形显示\n",
" plt.show()\n",
" return"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"simple_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 更复杂一点\n",
"def simple_advanced_plot():\n",
" \"\"\"\n",
" simple advanced plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" x = np.linspace(-np.pi, np.pi, 256, endpoint=True)\n",
" y_cos, y_sin = np.cos(x), np.sin(x)\n",
"\n",
" # 生成画布, 并设定标题\n",
" plt.figure(figsize=(8, 6), dpi=80)\n",
" plt.title(\"simple advanced plot\")\n",
" plt.grid(True)\n",
"\n",
" # 画图的另外一种方式\n",
" ax_1 = plt.subplot(111) # 也可以写成plt.subplot(1,1,1)\n",
" ax_1.plot(x, y_cos, color=\"blue\", linewidth=2.0, linestyle=\"--\", label=\"left cos\")\n",
" ax_1.legend(loc=\"upper left\", shadow=True)\n",
"\n",
" # 设置Y轴(左边)\n",
" ax_1.set_ylabel(\"left cos y\")\n",
" ax_1.set_ylim(-1.0, 1.0)\n",
" ax_1.set_yticks(np.linspace(-1, 1, 9, endpoint=True))\n",
"\n",
" # 画图的另外一种方式\n",
" ax_2 = ax_1.twinx()\n",
" ax_2.plot(x, y_sin, color=\"green\", linewidth=2.0, linestyle=\"-\", label=\"right sin\")\n",
" ax_2.legend(loc=\"upper right\", shadow=True)\n",
"\n",
" # 设置Y轴(右边)\n",
" ax_2.set_ylabel(\"right sin y\")\n",
" ax_2.set_ylim(-2.0, 2.0)\n",
" ax_2.set_yticks(np.linspace(-2, 2, 9, endpoint=True))\n",
"\n",
" # 设置X轴(共同)\n",
" ax_1.set_xlabel(\"x\")\n",
" ax_1.set_xlim(-4.0, 4.0)\n",
" ax_1.set_xticks(np.linspace(-4, 4, 9, endpoint=True))\n",
"\n",
" # 图形显示\n",
" plt.show()\n",
" return"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"simple_advanced_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 练习把上面的图,改一下线段颜色和形式, 如:red, yellow, ; -."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 一次画多个图\n",
"def subplot_plot():\n",
" \"\"\"\n",
" subplot plot\n",
" \"\"\"\n",
" # 子图的style列表\n",
" style_list = [\"g+-\", \"r*-\", \"b.-\", \"yo-\"]\n",
" \n",
" plt.figure(figsize=(8, 6), dpi=80)\n",
"\n",
" # 依次画图\n",
" for num in range(4):\n",
" # 生成测试数据\n",
" x = np.linspace(0.0, 2+num, num=10*(num+1))\n",
" y = np.sin((5-num) * np.pi * x)\n",
"\n",
" # 子图的生成方式\n",
" plt.subplot(2, 2, num+1)\n",
" plt.title(\"sub plot %d\" % (num+1))\n",
" plt.plot(x, y, style_list[num])\n",
"\n",
" # 图形显示\n",
" plt.show()\n",
" return"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"subplot_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 柱状图\n",
"def bar_plot():\n",
" \"\"\"\n",
" bar plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" means_men = (20, 35, 30, 35, 27)\n",
" means_women = (25, 32, 34, 20, 25)\n",
"\n",
" # 设置标题\n",
" plt.title(\"bar plot\")\n",
"\n",
" # 设置相关参数\n",
" index = np.arange(len(means_men))\n",
" bar_width = 0.35\n",
"\n",
" # 画柱状图\n",
" plt.bar(index, means_men, width=bar_width, alpha=0.2, color=\"b\", label=\"boy\")\n",
" plt.bar(index+bar_width, means_women, width=bar_width, alpha=0.8, color=\"r\", label=\"lady\")\n",
" plt.legend(loc=\"upper right\",shadow=True)\n",
"\n",
" # 设置柱状图标示\n",
" for x, y in zip(index, means_men):\n",
" plt.text(x, y+0.3, y, ha=\"center\", va=\"bottom\")\n",
" for x, y in zip(index, means_women):\n",
" plt.text(x+bar_width, y+0.3, y, ha=\"center\", va=\"bottom\")\n",
"\n",
" # 设置刻度范围/坐标轴名称等\n",
" plt.ylim(0, 45)\n",
" plt.xlabel(\"Group\")\n",
" plt.ylabel(\"Scores\")\n",
" plt.xticks(index+(bar_width/2), (\"A\", \"B\", \"C\", \"D\", \"E\"))\n",
"\n",
" # 图形显示\n",
" plt.show()\n",
" return"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bar_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 横向柱状图\n",
"def barh_plot():\n",
" \"\"\"\n",
" barh plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" means_men = (20, 35, 30, 35, 27)\n",
" means_women = (25, 32, 34, 20, 25)\n",
"\n",
" # 设置标题\n",
" plt.title(\"barh plot\")\n",
"\n",
" # 设置相关参数\n",
" index = np.arange(len(means_men))\n",
" bar_height = 0.35\n",
"\n",
" # 画柱状图(水平方向)\n",
" plt.barh(index, means_men, height=bar_height, alpha=0.2, color=\"b\", label=\"Men\")\n",
" plt.barh(index+bar_height, means_women, height=bar_height, alpha=0.8, color=\"r\", label=\"Women\")\n",
" plt.legend(loc=\"upper right\", shadow=True)\n",
"\n",
" # 设置柱状图标示\n",
" for x, y in zip(index, means_men):\n",
" plt.text(y+0.3, x, y, ha=\"left\", va=\"center\")\n",
" for x, y in zip(index, means_women):\n",
" plt.text(y+0.3, x+bar_height, y, ha=\"left\", va=\"center\")\n",
"\n",
" # 设置刻度范围/坐标轴名称等\n",
" plt.xlim(0, 45)\n",
" plt.xlabel(\"Scores\")\n",
" plt.ylabel(\"Group\")\n",
" plt.yticks(index+(bar_height/2), (\"A\", \"B\", \"C\", \"D\", \"E\"))\n",
"\n",
" # 图形显示\n",
" plt.show()\n",
" return"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"barh_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 层次柱状图\n",
"def table_plot():\n",
" \"\"\"\n",
" table plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" data = np.array([\n",
" [1, 4, 2, 5, 2],\n",
" [2, 1, 1, 3, 6],\n",
" [5, 3, 6, 4, 1]\n",
" ])\n",
"\n",
" # 设置标题\n",
" plt.title(\"table plot\")\n",
"\n",
" # 设置相关参数\n",
" index = np.arange(len(data[0]))\n",
" color_index = [\"r\", \"g\", \"b\"]\n",
"\n",
" # 声明底部位置\n",
" bottom = np.array([0, 0, 0, 0, 0])\n",
"\n",
" # 依次画图,并更新底部位置\n",
" for i in range(len(data)):\n",
" plt.bar(index, data[i], width=0.5, color=color_index[i], bottom=bottom, alpha=0.7, label=\"label %d\" % i)\n",
" bottom += data[i]\n",
"\n",
" # 设置图例位置\n",
" plt.legend(loc=\"upper left\", shadow=True)\n",
"\n",
" # 图形显示\n",
" plt.show()\n",
" return"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"table_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 饼图\n",
"def pie_plot():\n",
" \"\"\"\n",
" pie plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" sizes = [15, 30, 45, 10]\n",
" labels = [\"Frogs\", \"Cat\", \"Dogs\", \"Logs\"]\n",
" colors = [\"yellowgreen\", \"gold\", \"lightskyblue\", \"lightcoral\"]\n",
"\n",
" # 设置标题\n",
" plt.title(\"pie\")\n",
"\n",
" # 设置突出参数\n",
" explode = [0, 0.05, 0, 0]\n",
"\n",
" # 画饼状图\n",
" patches, l_text, p_text = plt.pie(sizes, explode=explode, labels=labels, colors=colors, autopct=\"%1.1f%%\", shadow=True, startangle=90)\n",
"\n",
" plt.axis(\"equal\")\n",
"\n",
" # 图形显示\n",
" plt.show()\n",
" return\n",
"# pie_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pie_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 散点图\n",
"def scatter_plot():\n",
" \"\"\"\n",
" scatter plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" point_count = 1000\n",
" x_index = np.random.random(point_count)\n",
" y_index = np.random.random(point_count)\n",
"\n",
" # 设置标题\n",
" plt.title(\"scatter\")\n",
"\n",
" # 设置相关参数\n",
" color_list = np.random.random(point_count)\n",
" scale_list = np.random.random(point_count) * 100\n",
"\n",
" # 画散点图\n",
" plt.scatter(x_index, y_index, s=scale_list, c=color_list, marker=\"o\")\n",
"\n",
" # 图形显示\n",
" plt.show()\n",
" return\n",
"# scatter_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"scatter_plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 雷达图\n",
"def radar_plot():\n",
" \"\"\"\n",
" radar plot\n",
" \"\"\"\n",
" # 生成测试数据\n",
" labels = np.array([\"A\", \"B\", \"C\", \"D\", \"E\", \"F\"])\n",
" data = np.array([68, 83, 90, 77, 89, 73])\n",
" theta = np.linspace(0, 2*np.pi, len(data), endpoint=False)\n",
"\n",
" # 数据预处理\n",
" data = np.concatenate((data, [data[0]]))\n",
" theta = np.concatenate((theta, [theta[0]]))\n",
"\n",
" # 画图方式\n",
" plt.subplot(111, polar=True)\n",
" plt.title(\"radar\")\n",
"\n",
" # 设置\"theta grid\"/\"radar grid\"\n",
" plt.thetagrids(theta*(180/np.pi), labels=labels)\n",
" plt.rgrids(np.arange(20, 100, 20), labels=np.arange(20, 100, 20), angle=0)\n",
" plt.ylim(0, 100)\n",
"\n",
" # 画雷达图,并填充雷达图内部区域\n",
" plt.plot(theta, data, \"bo-\", linewidth=2)\n",
" plt.fill(theta, data, color=\"red\", alpha=0.25)\n",
" \n",
" # 保存图片\n",
" plt.savefig('radar.png')\n",
" # 图形显示\n",
" plt.show()\n",
" return"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"radar_plot()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 练习\n",
"# 测试数据,5次考试的平均值, 第一行是1班考试成绩,第二行是2班考试成绩, 自定义一个主题,画一个柱状图, 画一个饼图, 如1班和2班5次成绩比较, \n",
"data = np.array([\n",
" [80, 84, 92, 100, 62],\n",
" [60, 100, 100, 93, 86],\n",
"])\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pandas 中的使用\n",
"\n",
"[参数说明](https://blog.csdn.net/claroja/article/details/73872066?utm_source=debugrun&utm_medium=referral)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"x = np.linspace(0,2*np.pi,100) # 从0 到 2π 取100份\n",
"df = pd.DataFrame(data={'sin':np.sin(x),'cos':np.cos(x)},index=x)#创建DataFrame对象\n",
"print(df.head())\n",
"df.plot()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df.plot(title='title', fontsize=20, figsize=(8, 6), grid=True)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"s = pd.Series(data=np.random.randint(0,10,size=5),index=list('abcde')) \n",
"s.plot(kind='barh')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df = pd.DataFrame(np.random.randint(0,150,size=(20,3)),columns=['python','math','eng'])\n",
"df.plot(kind='scatter',x='python',y='eng')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df['php'] = df['python'].map(lambda x: x*0.9+np.random.randint(-10,10,1)[0])\n",
"df.plot(kind='scatter',x='python',y='php')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习 录入数据, 用id 和 price 两个参数画一个柱状图 , 散点图\n",
"wz_df = pd.read_csv('datas/waizi_v2.csv')\n",
"wz_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"wz_df = wz_df[wz_df['type']=='合同外资金额']\n",
"wz_df.plot(kind='barh',x='id',y='price')\n",
"plt.show()\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Seaborn\n",
"\n",
"那么Pandas与Seaborn之间有什么区别呢?\n",
"\n",
"其实两者都是使用了matplotlib来作图,但是有非常不同的设计差异\n",
"\n",
"1. 在只需要简单地作图时直接用Pandas,但要想做出更加吸引人,更丰富的图就可以使用Seaborn\n",
"2. Pandas的作图函数并没有太多的参数来调整图形,所以你必须要深入了解matplotlib\n",
"3. Seaborn的作图函数中提供了大量的参数来调整图形,所以并不需要太深入了解matplotlib\n",
"4. Seaborn的API:https://stanford.edu/~mwaskom/software/seaborn/api.html#style-frontend"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 泰坦尼克号数据分析\n",
"\n",
"这是是历史中著名的海难事件,大量游客在事故中丧生,也有部分游客获救。现在这里有一份数据给出一批乘客的信息如姓名、年龄、性别、票价等等一些信息,和是否获救,然后让你建模分析,再去预测另一批乘客的获救与否。我们一起来看看\n",
"\n",
"## 掌握数据概况"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"sns.set()\n",
"\n",
"import warnings\n",
"warnings.filterwarnings(\"ignore\", category=DeprecationWarning)\n",
"from sklearn.utils.testing import ignore_warnings\n",
"\n",
"def warn(*args, **kwargs):\n",
" pass\n",
"import warnings\n",
"warnings.warn = warn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df_train = pd.read_csv(\"titanic/train.csv\")\n",
"df_test = pd.read_csv(\"titanic/test.csv\") # 留作练习让你们分析"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"df_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- PassengerId => 乘客ID\n",
"- Survived => 是否获救\n",
"- Pclass => 乘客等级(1/2/3等舱位)\n",
"- Name => 乘客姓名\n",
"- Sex => 性别\n",
"- Age => 年龄\n",
"- SibSp => 堂兄弟/妹个数\n",
"- Parch => 父母与小孩个数\n",
"- Ticket => 船票信息\n",
"- Fare => 票价\n",
"- Cabin => 客舱\n",
"- Embarked => 登船港口"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_train.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"粗略观察一下数据,发现age里有不少缺失,Cabin(舱号)大量缺失,其他属性个别缺失"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"fig, ax = plt.subplots(figsize=(9,5))\n",
"sns.heatmap(df_train.isnull(), cbar=False, cmap=\"YlGnBu_r\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 这些是类别列\n",
"cols = ['Survived', 'Sex', 'Pclass', 'SibSp', 'Parch', 'Embarked']"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nr_rows = 2\n",
"nr_cols = 3\n",
"\n",
"fig, axs = plt.subplots(nr_rows, nr_cols, figsize=(nr_cols*3.5,nr_rows*3))\n",
"\n",
"for r in range(0,nr_rows):\n",
" for c in range(0,nr_cols): \n",
" \n",
" i = r*nr_cols+c \n",
" ax = axs[r][c]\n",
" sns.countplot(df_train[cols[i]], hue=df_train[\"Survived\"], ax=ax)\n",
" ax.set_title(cols[i])\n",
" ax.legend() \n",
" \n",
"plt.tight_layout() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 认识数据\n",
"\n",
"- 第一张图:?\n",
"- 第二张图:?\n",
"- 第三张图:?\n",
"- 第四,五张图:?\n",
"- 第六张图: ?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 看看年龄的因素 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bins = np.arange(0, 80, 5)\n",
"g = sns.FacetGrid(df_train, row='Sex', col='Pclass', hue='Survived', margin_titles=True, size=3, aspect=1.1)\n",
"g.map(sns.distplot, 'Age', kde=False, bins=bins, hist_kws=dict(alpha=0.6))\n",
"g.add_legend() \n",
"plt.show() "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 分析一下"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 看看你票价因素 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_train['Fare'].max()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"bins = np.arange(0, 550, 20)\n",
"g = sns.FacetGrid(df_train, row='Sex', col='Pclass', hue='Survived', margin_titles=True, size=3, aspect=1.1)\n",
"g.map(sns.distplot, 'Fare', kde=False, bins=bins, hist_kws=dict(alpha=0.6))\n",
"g.add_legend() \n",
"plt.show() "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 仓位因素"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x='Pclass', y='Survived', data=df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of Pclass\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x='Sex', y='Survived', hue='Pclass', data=df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of Pclass and Sex\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 登船口因素"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.barplot(x='Embarked', y='Survived', data=df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of Embarked Port\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"sns.boxplot(x='Embarked', y='Fare', data=df_train)\n",
"plt.title(\"Fare distribution as function of Embarked Port\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 增加一些新维度\n",
"\n",
"### 家庭大小,单独,名字长度,称呼"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"for df in [df_train, df_test] :\n",
" \n",
" df['FamilySize'] = df['SibSp'] + df['Parch']\n",
" \n",
" df['Alone']=0\n",
" df.loc[(df.FamilySize==0),'Alone'] = 1\n",
" \n",
" df['NameLen'] = df.Name.apply(lambda x : len(x)) \n",
" df['NameLenBin']=np.nan\n",
" for i in range(20,0,-1):\n",
" df.loc[ df['NameLen'] <= i*5, 'NameLenBin'] = i\n",
" \n",
" \n",
" df['Title']=0\n",
" df['Title']=df.Name.str.extract(r'([A-Za-z]+)\\.') #lets extract the Salutations\n",
" df['Title'].replace(['Mlle','Mme','Ms','Dr','Major','Lady','Countess','Jonkheer','Col','Rev','Capt','Sir','Don'],\n",
" ['Miss','Miss','Miss','Mr','Mr','Mrs','Mrs','Other','Other','Other','Mr','Mr','Mr'],inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.subplots(figsize=(10,6))\n",
"sns.barplot(x='NameLenBin' , y='Survived' , data = df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of NameLenBin\")\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 结论??"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"g = sns.factorplot(x=\"NameLenBin\", y=\"Survived\", col=\"Sex\", data=df_train, kind=\"bar\", size=5, aspect=1.2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 结论??\n",
"\n",
"### 称呼因素"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.subplots(figsize=(10,6))\n",
"sns.barplot(x='Title' , y='Survived' , data = df_train)\n",
"plt.ylabel(\"Survival Rate\")\n",
"plt.title(\"Survival as function of Title\")\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"pd.crosstab(df_train.FamilySize,df_train.Survived).apply(lambda r: r/r.sum(), axis=1).style.background_gradient(cmap='summer_r')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 结论??"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 数据清洗\n",
"\n",
"### 第一步填充缺失数据"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 根据称呼补充他们的性别\n",
"df_train['Title'] = df_train['Title'].fillna(df_train['Title'].mode().iloc[0])\n",
"\n",
"# 年龄使用平均值填充\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Mr'),'Age']= df_train.Age[df_train.Title==\"Mr\"].mean()\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Mrs'),'Age']= df_train.Age[df_train.Title==\"Mrs\"].mean()\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Master'),'Age']= df_train.Age[df_train.Title==\"Master\"].mean()\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Miss'),'Age']= df_train.Age[df_train.Title==\"Miss\"].mean()\n",
"df_train.loc[(df_train.Age.isnull())&(df_train.Title=='Other'),'Age']= df_train.Age[df_train.Title==\"Other\"].mean()\n",
"df_train = df_train.drop('Name', axis=1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 设置登船口默认值是第一个\n",
"df_train['Embarked'] = df_train['Embarked'].fillna(df_train['Embarked'].mode().iloc[0])\n",
"# 票价用平均值填充\n",
"df_train['Fare'] = df_train['Fare'].fillna(df_train['Fare'].mean())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 年龄按10年分段,票价按50分段,方便查找规律\n",
"df = df_train\n",
"df['Age_bin']=np.nan\n",
"for i in range(8,0,-1):\n",
" df.loc[ df['Age'] <= i*10, 'Age_bin'] = i\n",
"\n",
"df['Fare_bin']=np.nan\n",
"for i in range(12,0,-1):\n",
" df.loc[ df['Fare'] <= i*50, 'Fare_bin'] = i \n",
"\n",
"# 把文字变成数字,让计算机更好处理\n",
"df['Title'] = df['Title'].map( {'Other':0, 'Mr': 1, 'Master':2, 'Miss': 3, 'Mrs': 4 } )\n",
"# 如果称呼为空,填充第一个\n",
"df['Title'] = df['Title'].fillna(df['Title'].mode().iloc[0])\n",
"df['Title'] = df['Title'].astype(int) "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 复制一份数据,保护原始数据\n",
"df_train_ml = df_train.copy()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_train_ml.info()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 把类别参数做成新的列,用0-1表示对应项\n",
"df_train_ml = pd.get_dummies(df_train_ml, columns=['Sex', 'Embarked', 'Pclass'], drop_first=True)\n",
"df_train_ml.drop(['PassengerId','Ticket','Cabin','Age', 'Fare_bin'],axis=1,inplace=True)\n",
"df_train_ml.dropna(inplace=True)\n",
"df_train_ml.drop(['NameLen'], axis=1, inplace=True)\n",
"df_train_ml.drop(['SibSp'], axis=1, inplace=True)\n",
"df_train_ml.drop(['Parch'], axis=1, inplace=True)\n",
"df_train_ml.drop(['Alone'], axis=1, inplace=True)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_train_ml.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 下一步就是机器学习,有兴趣同学可以看 sklearn\n",
"\n",
"官网:https://scikit-learn.org/stable/"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 进程和线程\n",
"\n",
"什么叫“多任务”呢?简单地说,就是操作系统可以同时运行多个任务。打个比方,你一边在用浏览器上网,一边在听MP3,一边在用Word赶作业,这就是多任务,至少同时有3个任务正在运行。还有很多任务悄悄地在后台同时运行着,只是桌面上没有显示而已。\n",
"\n",
"对于操作系统来说,一个任务就是一个进程(Process),比如打开一个浏览器就是启动一个浏览器进程\n",
"\n",
"有些进程还不止同时干一件事,比如Word,它可以同时进行打字、拼写检查、打印等事情。在一个进程内部,要同时干多件事,就需要同时运行多个“子任务”,我们把进程内的这些“子任务”称为线程(Thread)。\n",
"\n",
"由于每个进程至少要干一件事,所以,一个进程至少有一个线程。当然,像Word这种复杂的进程可以有多个线程,多个线程可以同时执行\n",
"\n",
"我们前面编写的所有的Python程序,都是执行单任务的进程,也就是只有一个线程。如果我们要同时执行多个任务怎么办?\n",
"\n",
"\n",
"### 有两种解决方案:\n",
"\n",
"一种是启动多个进程,每个进程虽然只有一个线程,但多个进程可以一块执行多个任务。\n",
"\n",
"还有一种方法是启动一个进程,在一个进程内启动多个线程,这样,多个线程也可以一块执行多个任务。\n",
"\n",
"同时执行多个任务通常各个任务之间并不是没有关联的,而是需要相互通信和协调,有时,任务1必须暂停等待任务2完成后才能继续执行,有时,任务3和任务4又不能同时执行,所以,多进程和多线程的程序的复杂度要远远高于我们前面写的单进程单线程的程序。\n",
"\n",
"多进程和多线程的程序涉及到同步、数据共享的问题,编写起来更复杂。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 多进程\n",
"\n",
"要让Python程序实现多进程(multiprocessing),我们先了解操作系统的相关知识。\n",
"\n",
"Unix/Linux操作系统提供了一个fork()系统调用,它非常特殊。普通的函数调用,调用一次,返回一次,但是fork()调用一次,返回两次,因为操作系统自动把当前进程(称为父进程)复制了一份(称为子进程),然后,分别在父进程和子进程内返回。\n",
"\n",
"子进程永远返回0,而父进程返回子进程的ID。这样做的理由是,一个父进程可以fork出很多子进程,所以,父进程要记下每个子进程的ID,而子进程只需要调用getppid()就可以拿到父进程的ID。\n",
"\n",
"Python的os模块封装了常见的系统调用,其中就包括fork,可以在Python程序中轻松创建子进程:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import os\n",
"\n",
"print('进程 (%s) 启动...' % os.getpid())\n",
"# Only works on Unix/Linux/Mac:\n",
"pid = os.fork()\n",
"\n",
"if pid == 0:\n",
" print('我是子进程(%s) ,我的父进程是(%s).' % (os.getpid(), os.getppid()))\n",
"else:\n",
" print('我 (%s) 创造了一个子进程 (%s).' % (os.getpid(), pid))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 由于Windows没有fork调用,上面的代码在Windows上无法运行"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### multiprocessing\n",
"如果你打算编写多进程的服务程序,Unix/Linux无疑是正确的选择。由于Windows没有fork调用,难道在Windows上无法用Python编写多进程的程序?\n",
"\n",
"由于Python是跨平台的,自然也应该提供一个跨平台的多进程支持。multiprocessing模块就是跨平台版本的多进程模块。\n",
"\n",
"multiprocessing模块提供了一个Process类来代表一个进程对象,下面的例子演示了启动一个子进程并等待其结束:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from multiprocessing import Process\n",
"import os\n",
"import time"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 子进程要执行的代码\n",
"def run_proc(name):\n",
" time.sleep(10)\n",
" print('运行子进程 %s (%s)...' % (name, os.getpid()))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"p = Process(target=run_proc, args=('test',))\n",
"print('开启父进程.')\n",
"p.start()\n",
"print('父进程 %s.' % os.getpid())\n",
"p.join()\n",
"print('子进程结束.')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"创建子进程时,只需要传入一个执行函数和函数的参数,创建一个Process实例,用start()方法启动,这样创建进程比fork()还要简单。\n",
"\n",
"join()方法可以等待子进程结束后再继续往下运行,通常用于进程间的同步。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 进程间通信\n",
"\n",
"Process之间肯定是需要通信的,操作系统提供了很多机制来实现进程间的通信。Python的multiprocessing模块包装了底层的机制,提供了Queue、Pipes等多种方式来交换数据。\n",
"\n",
"我们以Queue为例,在父进程中创建两个子进程,一个往Queue里写数据,一个从Queue里读数据:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from multiprocessing import Process, Queue\n",
"import os, time, random\n",
"\n",
"# 写数据进程执行的代码:\n",
"def write(q):\n",
" print('Process to write: %s' % os.getpid())\n",
" for value in ['A', 'B', 'C']:\n",
" print('Put %s to queue...' % value)\n",
" q.put(value)\n",
" time.sleep(random.random())\n",
"\n",
"# 读数据进程执行的代码:\n",
"def read(q):\n",
" print('Process to read: %s' % os.getpid())\n",
" while True:\n",
" value = q.get(True)\n",
" print('Get %s from queue.' % value)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 父进程创建Queue,并传给各个子进程:\n",
"q = Queue()\n",
"pw = Process(target=write, args=(q,))\n",
"pr = Process(target=read, args=(q,))\n",
"# 启动子进程pw,写入:\n",
"pw.start()\n",
"# 启动子进程pr,读取:\n",
"pr.start()\n",
"# 等待pw结束:\n",
"pw.join()\n",
"# pr进程里是死循环,无法等待其结束,只能强行终止:\n",
"pr.terminate()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 多线程\n",
"\n",
"多任务可以由多进程完成,也可以由一个进程内的多线程完成。\n",
"\n",
"我们前面提到了进程是由若干线程组成的,一个进程至少有一个线程。\n",
"\n",
"由于线程是操作系统直接支持的执行单元,因此,高级语言通常都内置多线程的支持,Python也不例外,并且,Python的线程是真正的Posix Thread,而不是模拟出来的线程。\n",
"\n",
"Python的标准库提供了两个模块:_thread和threading,_thread是低级模块,threading是高级模块,对_thread进行了封装。绝大多数情况下,我们只需要使用threading这个高级模块。\n",
"\n",
"启动一个线程就是把一个函数传入并创建Thread实例,然后调用start()开始执行:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import time, threading, random"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 新线程执行的代码:\n",
"def loop():\n",
" print('线程 %s 运行中...' % threading.current_thread().name)\n",
" n = 0\n",
" while n < 5:\n",
" n = n + 1\n",
" print('线程 %s >>> %s' % (threading.current_thread().name, n))\n",
" time.sleep(1)\n",
" print('线程 %s 结束.' % threading.current_thread().name)\n",
"\n",
"print('线程 %s 在运行...' % threading.current_thread().name)\n",
"t = threading.Thread(target=loop, name='LoopThread')\n",
"t.start()\n",
"t.join()\n",
"print('线程 %s 结束.' % threading.current_thread().name)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"由于任何进程默认就会启动一个线程,我们把该线程称为主线程,主线程又可以启动新的线程,Python的threading模块有个current_thread()函数,它永远返回当前线程的实例。主线程实例的名字叫MainThread,子线程的名字在创建时指定,我们用LoopThread命名子线程。名字仅仅在打印时用来显示,完全没有其他意义,如果不起名字Python就自动给线程命名为Thread-1,Thread-2……\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 应用例子\n",
"\n",
"多线程爬虫,用多线程提高爬虫效率,爬虫花费大量时间在等待网页的回复,CPU利用率不高,所以我们可以同时打开多个网页,提高效率!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import threading\n",
"import requests\n",
"import time,random\n",
"\n",
"\n",
"def func1(url):\n",
" print('打开网页%s, 模拟爬虫工作' % url)\n",
" res = requests.get(url)\n",
" time.sleep(random.randint(2,30))\n",
" print('结束,%s 返回结果 %s' % (url, res.status_code))\n",
"\n",
"def func2(urlinfo):\n",
" for i in urlinfo:\n",
" th = threading.Thread(target=func1,args=[i])\n",
" th.start()\n",
" print('主程序结束')\n",
"\n",
"\n",
"urlinfo = ['http://www.sohu.com', 'http://www.163.com', 'http://www.sina.com']\n",
"func2(urlinfo)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 练习 输出 0到11,观察他们的输出顺序\n",
"import threading\n",
"num = 0\n",
"\n",
"\n",
"def t():\n",
" global ???\n",
" num += 1\n",
" print(num)\n",
"\n",
"for i in range(0, 11):\n",
" d = threading.Thread(???)\n",
" d.???"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 正则表达式\n",
"\n",
"字符串是编程时涉及到的最多的一种数据结构,对字符串进行操作的需求几乎无处不在。比如判断一个字符串是否是合法的Email地址,虽然可以编程提取@前后的子串,再分别判断是否是单词和域名,但这样做不但麻烦,而且代码难以复用。\n",
"\n",
"正则表达式是一种用来匹配字符串的强有力的武器。它的设计思想是用一种描述性的语言来给字符串定义一个规则,凡是符合规则的字符串,我们就认为它“匹配”了,否则,该字符串就是不合法的。\n",
"\n",
"所以我们判断一个字符串是否是合法的Email的方法是:\n",
"\n",
"创建一个匹配Email的正则表达式;\n",
"\n",
"用该正则表达式去匹配用户的输入来判断是否合法。\n",
"\n",
"因为正则表达式也是用字符串表示的,所以,我们要首先了解如何用字符来描述字符。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### re.match函数\n",
"\n",
"函数语法:\n",
"\n",
"```re.match(pattern, string, flags=0)```\n",
"\n",
"pattern\t匹配的正则表达式\n",
"\n",
"string\t要匹配的字符串。\n",
"\n",
"flags\t标志位,用于控制正则表达式的匹配方式,如:是否区分大小写,多行匹配等等。参见:正则表达式修饰符 - 可选标志"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import re"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#在起始位置匹配\n",
"print(re.match('www', 'www.163.com').span())\n",
"print(re.match('163', 'www.163.com'))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"line = \"Cats Are smarter than dogs\"\n",
"\n",
"matchObj = re.match( r'(.*) are (.*?) .*', line, re.I)\n",
" \n",
"if matchObj:\n",
" print(\"matchObj.group() : \", matchObj.group())\n",
" print(\"matchObj.group(1) : \", matchObj.group(1))\n",
" print(\"matchObj.group(2) : \", matchObj.group(2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"1. re.I\t 使匹配对大小写不敏感\n",
"2. re.L\t 做本地化识别(locale-aware)匹配\n",
"3. re.M\t 多行匹配,影响 ^ 和 $\n",
"4. re.S\t 使 . 匹配包括换行在内的所有字符\n",
"5. re.U\t 根据Unicode字符集解析字符。这个标志影响 \\w, \\W, \\b, \\B.\n",
"6. re.X\t 该标志通过给予你更灵活的格式以便你将正则表达式写得更易于理解。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### re.search方法\n",
"\n",
"re.search 扫描整个字符串并返回第一个成功的匹配。\n",
"\n",
"函数语法:\n",
"\n",
"```re.search(pattern, string, flags=0)```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"print(re.search('www', 'www.163.com').span()) # 在起始位置匹配\n",
"print(re.search('163', 'www.163.com').span()) # 不在起始位置匹配"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"line = \"Cats Are smarter than dogs\"\n",
"\n",
"matchObj = re.search( r'(.*) are (.*?) .*', line, re.I)\n",
" \n",
"if matchObj:\n",
" print(\"matchObj.group() : \", matchObj.group())\n",
" print(\"matchObj.group(1) : \", matchObj.group(1))\n",
" print(\"matchObj.group(2) : \", matchObj.group(2))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### re.match与re.search的区别\n",
"re.match只匹配字符串的开始,如果字符串开始不符合正则表达式,则匹配失败,函数返回None;而re.search匹配整个字符串,直到找到一个匹配。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 练习,找出“dogs”\n",
"line = \"Cats are smarter than dogs\";\n",
" \n",
"matchObj = re.???( r'dogs', line, re.I)\n",
"if matchObj:\n",
" print(\"match --> matchObj.group() : \", matchObj.group())\n",
"else:\n",
" print(\"No match!!\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 检索和替换\n",
"Python 的 re 模块提供了re.sub用于替换字符串中的匹配项。\n",
"\n",
"语法:\n",
"\n",
"```re.sub(pattern, repl, string, count=0, flags=0)```\n",
"\n",
"- pattern : 正则中的模式字符串。\n",
"- repl : 替换的字符串,也可为一个函数。\n",
"- string : 要被查找替换的原始字符串。\n",
"- count : 模式匹配后替换的最大次数,默认 0 表示替换所有的匹配。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"phone = \"0757-86547-1548 # 这是一个电话号码\"\n",
" \n",
"# 删除字符串中的 Python注释 \n",
"num = re.sub(r'#.*$', \"\", phone)\n",
"print(\"电话号码是: \", num)\n",
" \n",
"# 删除非数字(-)的字符串 \n",
"num = re.sub(r'\\D', \"\", phone)\n",
"#num = re.sub(r'-', \"\", num)\n",
"print(\"电话号码是 : \", num)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- \\d\t匹配一个Unicode数字,如果带re.ASCII,则匹配0-9\n",
"- \\D 匹配Unicode非数字\n",
"- \\s\t匹配Unicode空白,如果带有re.ASCII,则匹配\\t\\n\\r\\f\\v中的一个\n",
"- \\S 匹配Unicode非空白\n",
"- \\w\t匹配Unicode单词字符,如果带有re.ascii,则匹配[a-zA-Z0-9_]中的一个\n",
"- \\W 匹配Unicode非单子字符"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 将匹配的数字乘以 2\n",
"def double(matched):\n",
" value = int(matched.group('value'))\n",
" return str(value * 2)\n",
" \n",
"s = 'A23G4HFD423'\n",
"print(re.sub('(?P<value>\\d+)', double, s))"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"{'province': '110', 'city': '223', 'born_year': '1990', 'born_month': '03', 'born_date': '06'}\n"
]
}
],
"source": [
"# 分组匹配\n",
"import re\n",
"s = '110223199003060030'\n",
"res = re.search('(?P<province>\\d{3})(?P<city>\\d{3})(?P<born_year>\\d{4})(?P<born_month>\\d{2})(?P<born_date>\\d{2})',s)\n",
"print(res.groupdict())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 练习:不用正则表达方式实现同样功能"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### re.compile 函数\n",
"compile 函数用于编译正则表达式,生成一个正则表达式( Pattern )对象,供 match() 和 search() 这两个函数使用。\n",
"\n",
"语法格式为:\n",
"\n",
"```re.compile(pattern[, flags])```\n",
"\n",
"- pattern : 一个字符串形式的正则表达式\n",
"\n",
"- flags : 可选,表示匹配模式,比如忽略大小写,多行模式等,具体参数为:\n",
"\n",
"- re.I 忽略大小写\n",
"- re.L 表示特殊字符集 \\w, \\W, \\b, \\B, \\s, \\S 依赖于当前环境\n",
"- re.M 多行模式\n",
"- re.S 即为 . 并且包括换行符在内的任意字符(. 不包括换行符)\n",
"- re.U 表示特殊字符集 \\w, \\W, \\b, \\B, \\d, \\D, \\s, \\S 依赖于 Unicode 字符属性数据库\n",
"- re.X 为了增加可读性,忽略空格和 # 后面的注释"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pattern = re.compile(r'\\d+')\n",
"m = pattern.match('one12twothree34four')\n",
"print(m)\n",
"m = pattern.match('one12twothree34four', 3, 10)\n",
"print(m)\n",
"print(m.group(0))\n",
"print(m.start(0), m.end(0), m.span())"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pattern = re.compile(r'([a-z]+) ([a-z]+)', re.I)\n",
"m = pattern.match('Hello World Wide Web')\n",
"print(m)\n",
"print(m.group(0)) # 返回匹配成功的整个子串\n",
"print(m.group(1)) # 返回第一个分组匹配成功的子串\n",
"print(m.group(2)) # 返回第二个分组匹配成功的子串\n",
"print(m.groups()) # 等价于 (m.group(1), m.group(2), ...)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### findall\n",
"在字符串中找到正则表达式所匹配的所有子串,并返回一个列表,如果没有找到匹配的,则返回空列表。\n",
"\n",
"注意: match 和 search 是匹配一次 findall 匹配所有。\n",
"\n",
"语法格式为:\n",
"\n",
"```findall(string[, pos[, endpos]])```\n",
"\n",
"参数:\n",
"\n",
"- string : 待匹配的字符串。\n",
"- pos : 可选参数,指定字符串的起始位置,默认为 0。\n",
"- endpos : 可选参数,指定字符串的结束位置,默认为字符串的长度。"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"pattern = re.compile(r'\\d+')\n",
"m = pattern.findall('one12twothree34four')\n",
"print(m)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### re.split\n",
"split 方法按照能够匹配的子串将字符串分割后返回列表,它的使用形式如下:\n",
"\n",
"```re.split(pattern, string[, maxsplit=0, flags=0])```"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"res = re.split('\\W+', 'abx, 123sd, good.')\n",
"print(res)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 练习不用正则表达,实现相同功能\n",
"s = 'sd1xxx2aa2a3sd3xx12yy'\n",
"res = re.split('\\d+', s)\n",
"print(res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 正则表达式模式\n",
"模式字符串使用特殊的语法来表示一个正则表达式:\n",
"\n",
"字母和数字表示他们自身。一个正则表达式模式中的字母和数字匹配同样的字符串。\n",
"\n",
"多数字母和数字前加一个反斜杠时会拥有不同的含义。\n",
"\n",
"标点符号只有被转义时才匹配自身,否则它们表示特殊的含义。\n",
"\n",
"反斜杠本身需要使用反斜杠转义。\n",
"\n",
"由于正则表达式通常都包含反斜杠,所以你最好使用原始字符串来表示它们。模式元素(如 r'\\t',等价于 '\\\\t')匹配相应的特殊字符。\n",
"\n",
"下表列出了正则表达式模式语法中的特殊元素。如果你使用模式的同时提供了可选的标志参数,某些模式元素的含义会改变。"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 基本符号\n",
"\n",
"- ^ 表示匹配字符串的开始位置 (例外 用在中括号中[ ] 时,可以理解为取反,表示不匹配括号中字符串)\n",
"- $ 表示匹配字符串的结束位置\n",
"- * 表示匹配 零次到多次\n",
"- + 表示匹配 一次到多次 (至少有一次)\n",
"- ? 表示匹配零次或一次\n",
"- . 表示匹配单个字符 \n",
"- | 表示为或者,两项中取一项\n",
"- ( ) 小括号表示匹配括号中全部字符\n",
"- [ ] 中括号表示匹配括号中一个字符 范围描述 如[0-9 a-z A-Z]\n",
"- { } 大括号用于限定匹配次数 如 {n}表示匹配n个字符 {n,}表示至少匹配n个字符 {n,m}表示至少n,最多m\n",
"- \\ 转义字符 如上基本符号匹配都需要转义字符 如 \\* 表示匹配*号\n",
"- \\w 表示英文字母和数字 \\W 非字母和数字\n",
"- \\d 表示数字 \\D 非数字\n",
"----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"常用的正则表达式(转)\n",
"匹配中文字符的正则表达式: [\\u4e00-\\u9fa5]\n",
"\n",
"匹配双字节字符(包括汉字在内):[^\\x00-\\xff]\n",
"\n",
"匹配空行的正则表达式:\\n[\\s| ]*\\r\n",
"\n",
"匹配HTML标记的正则表达式:/<(.*)>.*<\\/\\1>|<(.*) \\/>/ \n",
"\n",
"匹配首尾空格的正则表达式:(^\\s*)|(\\s*$)\n",
"\n",
"匹配IP地址的正则表达式:/(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)/g //\n",
"\n",
"匹配Email地址的正则表达式:\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*\n",
"\n",
"匹配网址URL的正则表达式:http://(/[\\w-]+\\.)+[\\w-]+(/[\\w- ./?%&=]*)?\n",
"\n",
"sql语句:^(select|drop|delete|create|update|insert).*$\n",
"\n",
"1、非负整数:^\\d+$ \n",
"\n",
"2、正整数:^[0-9]*[1-9][0-9]*$ \n",
"\n",
"3、非正整数:^((-\\d+)|(0+))$ \n",
"\n",
"4、负整数:^-[0-9]*[1-9][0-9]*$ \n",
"\n",
"5、整数:^-?\\d+$ \n",
"\n",
"6、非负浮点数:^\\d+(\\.\\d+)?$ \n",
"\n",
"7、正浮点数:^((0-9)+\\.[0-9]*[1-9][0-9]*)|([0-9]*[1-9][0-9]*\\.[0-9]+)|([0-9]*[1-9][0-9]*))$ \n",
"\n",
"8、非正浮点数:^((-\\d+\\.\\d+)?)|(0+(\\.0+)?))$ \n",
"\n",
"9、负浮点数:^(-((正浮点数正则式)))$ \n",
"\n",
"10、英文字符串:^[A-Za-z]+$ \n",
"\n",
"11、英文大写串:^[A-Z]+$ \n",
"\n",
"12、英文小写串:^[a-z]+$ \n",
"\n",
"13、英文字符数字串:^[A-Za-z0-9]+$ \n",
"\n",
"14、英数字加下划线串:^\\w+$ \n",
"\n",
"15、E-mail地址:^[\\w-]+(\\.[\\w-]+)*@[\\w-]+(\\.[\\w-]+)+$ \n",
"\n",
"16、URL:^[a-zA-Z]+://(\\w+(-\\w+)*)(\\.(\\w+(-\\w+)*))*(\\?\\s*)?$ \n",
"或:^http:\\/\\/[A-Za-z0-9]+\\.[A-Za-z0-9]+[\\/=\\?%\\-&_~`@[\\]\\':+!]*([^<>\\\"\\\"])*$\n",
"\n",
"17、邮政编码:^[1-9]\\d{5}$\n",
"\n",
"18、中文:^[\\u0391-\\uFFE5]+$\n",
"\n",
"19、电话号码:^((\\d2,3)|(\\d{3}\\-))?(0\\d2,3|0\\d{2,3}-)?[1-9]\\d{6,7}(\\-\\d{1,4})?$\n",
"\n",
"20、手机号码:^((\\d2,3)|(\\d{3}\\-))?13\\d{9}$\n",
"\n",
"21、双字节字符(包括汉字在内):^\\x00-\\xff\n",
"\n",
"22、匹配首尾空格:(^\\s*)|(\\s*$)(像vbscript那样的trim函数)\n",
"\n",
"23、匹配HTML标记:<(.*)>.*<\\/\\1>|<(.*) \\/> \n",
"\n",
"24、匹配空行:\\n[\\s| ]*\\r\n",
"\n",
"25、提取信息中的网络链接:(h|H)(r|R)(e|E)(f|F) *= *('|\")?(\\w|\\\\|\\/|\\.)+('|\"| *|>)?\n",
"\n",
"26、提取信息中的邮件地址:\\w+([-+.]\\w+)*@\\w+([-.]\\w+)*\\.\\w+([-.]\\w+)*\n",
"\n",
"27、提取信息中的图片链接:(s|S)(r|R)(c|C) *= *('|\")?(\\w|\\\\|\\/|\\.)+('|\"| *|>)?\n",
"\n",
"28、提取信息中的IP地址:(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)\n",
"\n",
"29、提取信息中的中国手机号码:(86)*0*13\\d{9}\n",
"\n",
"30、提取信息中的中国固定电话号码:(\\d3,4|\\d{3,4}-|\\s)?\\d{8}\n",
"\n",
"31、提取信息中的中国电话号码(包括移动和固定电话):(\\d3,4|\\d{3,4}-|\\s)?\\d{7,14}\n",
"\n",
"32、提取信息中的中国邮政编码:[1-9]{1}(\\d+){5}\n",
"\n",
"33、提取信息中的浮点数(即小数):(-?\\d*)\\.?\\d+\n",
"\n",
"34、提取信息中的任何数字 :(-?\\d*)(\\.\\d+)? \n",
"\n",
"35、IP:(\\d+)\\.(\\d+)\\.(\\d+)\\.(\\d+)\n",
"\n",
"36、电话区号:/^0\\d{2,3}$/\n",
"\n",
"37、腾讯QQ号:^[1-9]*[1-9][0-9]*$\n",
"\n",
"38、帐号(字母开头,允许5-16字节,允许字母数字下划线):^[a-zA-Z][a-zA-Z0-9_]{4,15}$\n",
"\n",
"39、中文、英文、数字及下划线:^[\\u4e00-\\u9fa5_a-zA-Z0-9]+$"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Requests\n",
"\n",
"它是一个Python第三方库,处理URL资源特别方便\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import requests # 如果这里出错,证明你还没有安装这个库"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"r = requests.get('https://www.toutiao.com/') # 今日头条\n",
"print(\"查看返回状态\", r.status_code) # 200代表成功 ,404, 403, 501这些意思可以百度查一下"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 查看一下内容\n",
"\"\"\"\n",
"print(r.text) # 返回正常的网页内容, 即解压解码之后的内容\n",
"print(r.content) # 返回byte类型的网页内容, 即值解压, 没有解码\n",
"print(r.json()) # 如果网页内容为json, 直接返回一个json对象\n",
"print(r.encoding) # 返回网页的编码: \"utf-8\"\n",
"\"\"\"\n",
"r.content"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 网页表头信息\n",
"r.headers"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from urllib.parse import urlencode\n",
"# 获取一些有意思的内容\n",
"def get_page(offset):\n",
" params = {\n",
" 'offset': offset,\n",
" 'format': 'json',\n",
" 'keyword': '搞笑',\n",
" 'autoload': 'true',\n",
" 'count': '20',\n",
" 'cur_tab': '3',\n",
" 'from': 'gallery',\n",
" }\n",
" url = 'https://www.toutiao.com/search_content/?' + urlencode(params)\n",
" try:\n",
" print(url)\n",
" response = requests.get(url)\n",
" if response.status_code == 200:\n",
" return response.json()\n",
" except requests.ConnectionError:\n",
" return None"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"contents = get_page(1)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(contents)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 分析一下结构\n",
"data = contents.get('data')\n",
"all_images = {}\n",
"if data:\n",
" for item in data:\n",
" # print(item)\n",
" image_list = item.get('image_list')\n",
" title = item.get('title')\n",
" item_id = item.get('id')\n",
" # print(image_list)\n",
" imgs = []\n",
" for image in image_list:\n",
" imgs.append(image.get('url')[2:])\n",
" \n",
" all_images[item_id] = {\n",
" 'title': title,\n",
" 'images': imgs\n",
" }\n",
"print(all_images)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 练习:保存图片 提示: os.path, 字符串处理(+http, 替换list->large, 文档操作)\n",
"# 建议使用Pycharm来写"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# 补充知识\n",
"\n",
"# 不同方式获取网页内容, 返回一个Response对象, 请求的参数可以为url或Request对象\n",
"r0 = requests.get(\"https://github.com/timeline.json\")\n",
"r1 = requests.post(\"http://httpbin.org/post\")\n",
"r2 = requests.put(\"http://httpbin.org/put\")\n",
"r3 = requests.delete(\"http://httpbin.org/delete\")\n",
"r4 = requests.head(\"http://httpbin.org/get\")\n",
"r5 = requests.options(\"http://httpbin.org/get\")\n",
"r6 = requests.patch(\"http://httpbin.org/get\")\n",
"\n",
"# 定制请求头: 一个字典\n",
"headers = {\"user-agent\": \"my-app/0.0.1\"}\n",
"r = requests.get(\"https://api.github.com/some/endpoint\", headers=headers)\n",
"print(r.request.headers) # 获取request的头部\n",
"print(r.headers) # 获取response的头部\n",
"\n",
"# 模拟一个手机的UA\n",
"# Mozilla/5.0 (Linux; Android 8.1.0; ALP-AL00 Build/HUAWEIALP-AL00; wv) \n",
"# AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/63.0.3239.83 \n",
"# Mobile Safari/537.36 T7/10.13 baiduboxapp/10.13.0.11 (Baidu; P1 8.1.0)\n",
"\n",
"# {\n",
"# \"content-encoding\": \"gzip\",\n",
"# \"transfer-encoding\": \"chunked\",\n",
"# \"connection\": \"close\",\n",
"# \"server\": \"nginx/1.0.4\",\n",
"# \"x-runtime\": \"148ms\",\n",
"# \"etag\": \"e1ca502697e5c9317743dc078f67693f\",\n",
"# \"content-type\": \"application/json\"\n",
"# }\n",
"print(r.headers[\"Content-Type\"]) # \"application/json\"\n",
"print(r.headers.get(\"content-type\")) # \"application/json\"\n",
"\n",
"# 更加复杂的POST请求: 表单\n",
"post_dict = {\"key1\": \"value1\", \"key2\": \"value2\"}\n",
"r = requests.post(\"http://httpbin.org/post\", data=post_dict)\n",
"print(r.text)\n",
"\n",
"# 要想发送你的cookies到服务器, 可以使用cookies参数(一个字典)\n",
"cookies = {\"cookies_are\": \"working\"}\n",
"r = requests.get(\"http://httpbin.org/cookies\", cookies=cookies)\n",
"print(r.text)\n",
"\n",
"# 会话对象: 会话对象让你能够跨请求保持某些参数, 它也会在同一个Session实例发出的所有请求之间保持cookie\n",
"s = requests.Session()\n",
"s.get(\"http://httpbin.org/cookies/set/sessioncookie/123456789\")\n",
"s.get(\"http://httpbin.org/cookies\")\n",
"for cookie in s.cookies:\n",
" print(cookie)\n",
"\n",
"# 如果你要手动为会话添加cookie, 就是用Cookie utility函数来操纵Session.cookies\n",
"requests.utils.add_dict_to_cookiejar(s.cookies, {\"cookie_key\": \"cookie_value\"})\n",
"\n",
"# 会话也可用来为请求方法提供缺省数据, 这是通过为会话对象的属性提供数据来实现的\n",
"s.auth = (\"user\", \"pass\")\n",
"s.headers.update({\"x-test\": \"true\"})\n",
"s.get(\"http://httpbin.org/headers\", headers={\"x-test2\": \"true\"})\n",
"# both \"x-test\" and \"x-test2\" are sent"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册