Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
张重言
rails
提交
c1d00d6d
R
rails
项目概览
张重言
/
rails
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
R
rails
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
c1d00d6d
编写于
10月 09, 2016
作者:
F
Fumiaki MATSUSHIMA
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
Remove `AS::Multibyte`'s unicode table
上级
0f05c87e
变更
4
隐藏空白更改
内联
并排
Showing
4 changed file
with
15 addition
and
439 deletion
+15
-439
activesupport/bin/generate_tables
activesupport/bin/generate_tables
+0
-141
activesupport/lib/active_support/multibyte/unicode.rb
activesupport/lib/active_support/multibyte/unicode.rb
+15
-272
activesupport/lib/active_support/values/unicode_tables.dat
activesupport/lib/active_support/values/unicode_tables.dat
+0
-0
activesupport/test/multibyte_unicode_database_test.rb
activesupport/test/multibyte_unicode_database_test.rb
+0
-26
未找到文件。
activesupport/bin/generate_tables
已删除
100755 → 0
浏览文件 @
0f05c87e
#!/usr/bin/env ruby
# frozen_string_literal: true
begin
$:
.
unshift
(
File
.
expand_path
(
"../lib"
,
__dir__
))
require
"active_support"
rescue
IOError
end
require
"open-uri"
require
"tmpdir"
require
"fileutils"
module
ActiveSupport
module
Multibyte
module
Unicode
class
UnicodeDatabase
def
load
;
end
end
class
DatabaseGenerator
BASE_URI
=
"http://www.unicode.org/Public/
#{
UNICODE_VERSION
}
/ucd/"
SOURCES
=
{
codepoints:
BASE_URI
+
"UnicodeData.txt"
,
composition_exclusion:
BASE_URI
+
"CompositionExclusions.txt"
,
grapheme_break_property:
BASE_URI
+
"auxiliary/GraphemeBreakProperty.txt"
,
cp1252:
"http://unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WINDOWS/CP1252.TXT"
}
def
initialize
@ucd
=
Unicode
::
UnicodeDatabase
.
new
end
def
parse_codepoints
(
line
)
codepoint
=
Codepoint
.
new
raise
"Could not parse input."
unless
line
=~
/^
([0-9A-F]+); # code
([^;]+); # name
([A-Z]+); # general category
([0-9]+); # canonical combining class
([A-Z]+); # bidi class
(<([A-Z]*)>)? # decomposition type
((\ ?[0-9A-F]+)*); # decomposition mapping
([0-9]*); # decimal digit
([0-9]*); # digit
([^;]*); # numeric
([YN]*); # bidi mirrored
([^;]*); # unicode 1.0 name
([^;]*); # iso comment
([0-9A-F]*); # simple uppercase mapping
([0-9A-F]*); # simple lowercase mapping
([0-9A-F]*)$/ix
# simple titlecase mapping
codepoint
.
code
=
$1
.
hex
codepoint
.
combining_class
=
Integer
(
$4
)
codepoint
.
decomp_type
=
$7
codepoint
.
decomp_mapping
=
(
$8
==
""
)
?
nil
:
$8
.
split
.
collect
(
&
:hex
)
codepoint
.
uppercase_mapping
=
(
$16
==
""
)
?
0
:
$16
.
hex
codepoint
.
lowercase_mapping
=
(
$17
==
""
)
?
0
:
$17
.
hex
@ucd
.
codepoints
[
codepoint
.
code
]
=
codepoint
end
def
parse_grapheme_break_property
(
line
)
if
line
=~
/^([0-9A-F.]+)\s*;\s*([\w]+)\s*#/
type
=
$2
.
downcase
.
intern
@ucd
.
boundary
[
type
]
||=
[]
if
$1
.
include?
".."
parts
=
$1
.
split
".."
@ucd
.
boundary
[
type
]
<<
(
parts
[
0
].
hex
..
parts
[
1
].
hex
)
else
@ucd
.
boundary
[
type
]
<<
$1
.
hex
end
end
end
def
parse_composition_exclusion
(
line
)
if
line
=~
/^([0-9A-F]+)/i
@ucd
.
composition_exclusion
<<
$1
.
hex
end
end
def
parse_cp1252
(
line
)
if
line
=~
/^([0-9A-Fx]+)\s([0-9A-Fx]+)/i
@ucd
.
cp1252
[
$1
.
hex
]
=
$2
.
hex
end
end
def
create_composition_map
@ucd
.
codepoints
.
each
do
|
_
,
cp
|
if
!
cp
.
nil?
&&
cp
.
combining_class
==
0
&&
cp
.
decomp_type
.
nil?
&&
!
cp
.
decomp_mapping
.
nil?
&&
cp
.
decomp_mapping
.
length
==
2
&&
@ucd
.
codepoints
[
cp
.
decomp_mapping
[
0
]].
combining_class
==
0
&&
!
@ucd
.
composition_exclusion
.
include?
(
cp
.
code
)
@ucd
.
composition_map
[
cp
.
decomp_mapping
[
0
]]
||=
{}
@ucd
.
composition_map
[
cp
.
decomp_mapping
[
0
]][
cp
.
decomp_mapping
[
1
]]
=
cp
.
code
end
end
end
def
normalize_boundary_map
@ucd
.
boundary
.
each
do
|
k
,
v
|
if
[
:lf
,
:cr
].
include?
k
@ucd
.
boundary
[
k
]
=
v
[
0
]
end
end
end
def
parse
SOURCES
.
each
do
|
type
,
url
|
filename
=
File
.
join
(
Dir
.
tmpdir
,
UNICODE_VERSION
,
"
#{
url
.
split
(
'/'
).
last
}
"
)
unless
File
.
exist?
(
filename
)
$stderr
.
puts
"Downloading
#{
url
.
split
(
'/'
).
last
}
"
FileUtils
.
mkdir_p
(
File
.
dirname
(
filename
))
File
.
open
(
filename
,
"wb"
)
do
|
target
|
open
(
url
)
do
|
source
|
source
.
each_line
{
|
line
|
target
.
write
line
}
end
end
end
File
.
open
(
filename
)
do
|
file
|
file
.
each_line
{
|
line
|
send
"parse_
#{
type
}
"
.
intern
,
line
}
end
end
create_composition_map
normalize_boundary_map
end
def
dump_to
(
filename
)
File
.
open
(
filename
,
"wb"
)
do
|
f
|
f
.
write
Marshal
.
dump
([
@ucd
.
codepoints
,
@ucd
.
composition_exclusion
,
@ucd
.
composition_map
,
@ucd
.
boundary
,
@ucd
.
cp1252
])
end
end
end
end
end
end
if
__FILE__
==
$0
filename
=
ActiveSupport
::
Multibyte
::
Unicode
::
UnicodeDatabase
.
filename
generator
=
ActiveSupport
::
Multibyte
::
Unicode
::
DatabaseGenerator
.
new
generator
.
parse
print
"Writing to:
#{
filename
}
"
generator
.
dump_to
filename
puts
" (
#{
File
.
size
(
filename
)
}
bytes)"
end
activesupport/lib/active_support/multibyte/unicode.rb
浏览文件 @
c1d00d6d
...
...
@@ -11,7 +11,7 @@ module Unicode
NORMALIZATION_FORMS
=
[
:c
,
:kc
,
:d
,
:kd
]
# The Unicode version that is supported by the implementation
UNICODE_VERSION
=
"9.0.0"
UNICODE_VERSION
=
RbConfig
::
CONFIG
[
"UNICODE_VERSION"
]
# The default normalization used for operations that require
# normalization. It can be set to any of the normalizations
...
...
@@ -21,96 +21,13 @@ module Unicode
attr_accessor
:default_normalization_form
@default_normalization_form
=
:kc
# Hangul character boundaries and properties
HANGUL_SBASE
=
0xAC00
HANGUL_LBASE
=
0x1100
HANGUL_VBASE
=
0x1161
HANGUL_TBASE
=
0x11A7
HANGUL_LCOUNT
=
19
HANGUL_VCOUNT
=
21
HANGUL_TCOUNT
=
28
HANGUL_NCOUNT
=
HANGUL_VCOUNT
*
HANGUL_TCOUNT
HANGUL_SCOUNT
=
11172
HANGUL_SLAST
=
HANGUL_SBASE
+
HANGUL_SCOUNT
# Detect whether the codepoint is in a certain character class. Returns
# +true+ when it's in the specified character class and +false+ otherwise.
# Valid character classes are: <tt>:cr</tt>, <tt>:lf</tt>, <tt>:l</tt>,
# <tt>:v</tt>, <tt>:lv</tt>, <tt>:lvt</tt> and <tt>:t</tt>.
#
# Primarily used by the grapheme cluster support.
def
in_char_class?
(
codepoint
,
classes
)
classes
.
detect
{
|
c
|
database
.
boundary
[
c
]
===
codepoint
}
?
true
:
false
end
# Unpack the string at grapheme boundaries. Returns a list of character
# lists.
#
# Unicode.unpack_graphemes('क्षि') # => [[2325, 2381], [2359], [2367]]
# Unicode.unpack_graphemes('Café') # => [[67], [97], [102], [233]]
def
unpack_graphemes
(
string
)
codepoints
=
string
.
codepoints
.
to_a
unpacked
=
[]
pos
=
0
marker
=
0
eoc
=
codepoints
.
length
while
(
pos
<
eoc
)
pos
+=
1
previous
=
codepoints
[
pos
-
1
]
current
=
codepoints
[
pos
]
# See http://unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules
should_break
=
if
pos
==
eoc
true
# GB3. CR X LF
elsif
previous
==
database
.
boundary
[
:cr
]
&&
current
==
database
.
boundary
[
:lf
]
false
# GB4. (Control|CR|LF) ÷
elsif
previous
&&
in_char_class?
(
previous
,
[
:control
,
:cr
,
:lf
])
true
# GB5. ÷ (Control|CR|LF)
elsif
in_char_class?
(
current
,
[
:control
,
:cr
,
:lf
])
true
# GB6. L X (L|V|LV|LVT)
elsif
database
.
boundary
[
:l
]
===
previous
&&
in_char_class?
(
current
,
[
:l
,
:v
,
:lv
,
:lvt
])
false
# GB7. (LV|V) X (V|T)
elsif
in_char_class?
(
previous
,
[
:lv
,
:v
])
&&
in_char_class?
(
current
,
[
:v
,
:t
])
false
# GB8. (LVT|T) X (T)
elsif
in_char_class?
(
previous
,
[
:lvt
,
:t
])
&&
database
.
boundary
[
:t
]
===
current
false
# GB9. X (Extend | ZWJ)
elsif
in_char_class?
(
current
,
[
:extend
,
:zwj
])
false
# GB9a. X SpacingMark
elsif
database
.
boundary
[
:spacingmark
]
===
current
false
# GB9b. Prepend X
elsif
database
.
boundary
[
:prepend
]
===
previous
false
# GB10. (E_Base | EBG) Extend* X E_Modifier
elsif
(
marker
...
pos
).
any?
{
|
i
|
in_char_class?
(
codepoints
[
i
],
[
:e_base
,
:e_base_gaz
])
&&
codepoints
[
i
+
1
...
pos
].
all?
{
|
c
|
database
.
boundary
[
:extend
]
===
c
}
}
&&
database
.
boundary
[
:e_modifier
]
===
current
false
# GB11. ZWJ X (Glue_After_Zwj | EBG)
elsif
database
.
boundary
[
:zwj
]
===
previous
&&
in_char_class?
(
current
,
[
:glue_after_zwj
,
:e_base_gaz
])
false
# GB12. ^ (RI RI)* RI X RI
# GB13. [^RI] (RI RI)* RI X RI
elsif
codepoints
[
marker
..
pos
].
all?
{
|
c
|
database
.
boundary
[
:regional_indicator
]
===
c
}
&&
codepoints
[
marker
..
pos
].
count
{
|
c
|
database
.
boundary
[
:regional_indicator
]
===
c
}.
even?
false
# GB999. Any ÷ Any
else
true
end
if
should_break
unpacked
<<
codepoints
[
marker
..
pos
-
1
]
marker
=
pos
end
end
unpacked
string
.
scan
(
/\X/
).
map
(
&
:codepoints
)
end
# Reverse operation of unpack_graphemes.
...
...
@@ -120,100 +37,18 @@ def pack_graphemes(unpacked)
unpacked
.
flatten
.
pack
(
"U*"
)
end
# Re-order codepoints so the string becomes canonical.
def
reorder_characters
(
codepoints
)
length
=
codepoints
.
length
-
1
pos
=
0
while
pos
<
length
do
cp1
,
cp2
=
database
.
codepoints
[
codepoints
[
pos
]],
database
.
codepoints
[
codepoints
[
pos
+
1
]]
if
(
cp1
.
combining_class
>
cp2
.
combining_class
)
&&
(
cp2
.
combining_class
>
0
)
codepoints
[
pos
..
pos
+
1
]
=
cp2
.
code
,
cp1
.
code
pos
+=
(
pos
>
0
?
-
1
:
1
)
else
pos
+=
1
end
end
codepoints
end
# Decompose composed characters to the decomposed form.
def
decompose
(
type
,
codepoints
)
codepoints
.
inject
([])
do
|
decomposed
,
cp
|
# if it's a hangul syllable starter character
if
HANGUL_SBASE
<=
cp
&&
cp
<
HANGUL_SLAST
sindex
=
cp
-
HANGUL_SBASE
ncp
=
[]
# new codepoints
ncp
<<
HANGUL_LBASE
+
sindex
/
HANGUL_NCOUNT
ncp
<<
HANGUL_VBASE
+
(
sindex
%
HANGUL_NCOUNT
)
/
HANGUL_TCOUNT
tindex
=
sindex
%
HANGUL_TCOUNT
ncp
<<
(
HANGUL_TBASE
+
tindex
)
unless
tindex
==
0
decomposed
.
concat
ncp
# if the codepoint is decomposable in with the current decomposition type
elsif
(
ncp
=
database
.
codepoints
[
cp
].
decomp_mapping
)
&&
(
!
database
.
codepoints
[
cp
].
decomp_type
||
type
==
:compatibility
)
decomposed
.
concat
decompose
(
type
,
ncp
.
dup
)
else
decomposed
<<
cp
end
if
type
==
:compatibility
codepoints
.
pack
(
"U*"
).
unicode_normalize
(
:nfkd
).
codepoints
else
codepoints
.
pack
(
"U*"
).
unicode_normalize
(
:nfd
).
codepoints
end
end
# Compose decomposed characters to the composed form.
def
compose
(
codepoints
)
pos
=
0
eoa
=
codepoints
.
length
-
1
starter_pos
=
0
starter_char
=
codepoints
[
0
]
previous_combining_class
=
-
1
while
pos
<
eoa
pos
+=
1
lindex
=
starter_char
-
HANGUL_LBASE
# -- Hangul
if
0
<=
lindex
&&
lindex
<
HANGUL_LCOUNT
vindex
=
codepoints
[
starter_pos
+
1
]
-
HANGUL_VBASE
rescue
vindex
=
-
1
if
0
<=
vindex
&&
vindex
<
HANGUL_VCOUNT
tindex
=
codepoints
[
starter_pos
+
2
]
-
HANGUL_TBASE
rescue
tindex
=
-
1
if
0
<=
tindex
&&
tindex
<
HANGUL_TCOUNT
j
=
starter_pos
+
2
eoa
-=
2
else
tindex
=
0
j
=
starter_pos
+
1
eoa
-=
1
end
codepoints
[
starter_pos
..
j
]
=
(
lindex
*
HANGUL_VCOUNT
+
vindex
)
*
HANGUL_TCOUNT
+
tindex
+
HANGUL_SBASE
end
starter_pos
+=
1
starter_char
=
codepoints
[
starter_pos
]
# -- Other characters
else
current_char
=
codepoints
[
pos
]
current
=
database
.
codepoints
[
current_char
]
if
current
.
combining_class
>
previous_combining_class
if
ref
=
database
.
composition_map
[
starter_char
]
composition
=
ref
[
current_char
]
else
composition
=
nil
end
unless
composition
.
nil?
codepoints
[
starter_pos
]
=
composition
starter_char
=
composition
codepoints
.
delete_at
pos
eoa
-=
1
pos
-=
1
previous_combining_class
=
-
1
else
previous_combining_class
=
current
.
combining_class
end
else
previous_combining_class
=
current
.
combining_class
end
if
current
.
combining_class
==
0
starter_pos
=
pos
starter_char
=
codepoints
[
pos
]
end
end
end
codepoints
codepoints
.
pack
(
"U*"
).
unicode_normalize
(
:nfc
).
codepoints
end
# Rubinius' String#scrub, however, doesn't support ASCII-incompatible chars.
...
...
@@ -266,129 +101,37 @@ def tidy_bytes(string, force = false)
def
normalize
(
string
,
form
=
nil
)
form
||=
@default_normalization_form
# See http://www.unicode.org/reports/tr15, Table 1
codepoints
=
string
.
codepoints
.
to_a
case
form
when
:d
reorder_characters
(
decompose
(
:canonical
,
codepoints
)
)
string
.
unicode_normalize
(
:nfd
)
when
:c
compose
(
reorder_characters
(
decompose
(
:canonical
,
codepoints
))
)
string
.
unicode_normalize
(
:nfc
)
when
:kd
reorder_characters
(
decompose
(
:compatibility
,
codepoints
)
)
string
.
unicode_normalize
(
:nfkd
)
when
:kc
compose
(
reorder_characters
(
decompose
(
:compatibility
,
codepoints
))
)
string
.
unicode_normalize
(
:nfkc
)
else
raise
ArgumentError
,
"
#{
form
}
is not a valid normalization variant"
,
caller
end
.
pack
(
"U*"
.
freeze
)
end
end
def
downcase
(
string
)
apply_mapping
string
,
:lowercase_mapping
string
.
downcase
end
def
upcase
(
string
)
apply_mapping
string
,
:uppercase_mapping
string
.
upcase
end
def
swapcase
(
string
)
apply_mapping
string
,
:swapcase_mapping
end
# Holds data about a codepoint in the Unicode database.
class
Codepoint
attr_accessor
:code
,
:combining_class
,
:decomp_type
,
:decomp_mapping
,
:uppercase_mapping
,
:lowercase_mapping
# Initializing Codepoint object with default values
def
initialize
@combining_class
=
0
@uppercase_mapping
=
0
@lowercase_mapping
=
0
end
def
swapcase_mapping
uppercase_mapping
>
0
?
uppercase_mapping
:
lowercase_mapping
end
end
# Holds static data from the Unicode database.
class
UnicodeDatabase
ATTRIBUTES
=
:codepoints
,
:composition_exclusion
,
:composition_map
,
:boundary
,
:cp1252
attr_writer
(
*
ATTRIBUTES
)
def
initialize
@codepoints
=
Hash
.
new
(
Codepoint
.
new
)
@composition_exclusion
=
[]
@composition_map
=
{}
@boundary
=
{}
@cp1252
=
{}
end
# Lazy load the Unicode database so it's only loaded when it's actually used
ATTRIBUTES
.
each
do
|
attr_name
|
class_eval
(
<<-
EOS
,
__FILE__
,
__LINE__
+
1
)
def
#{
attr_name
}
# def codepoints
load # load
@
#{
attr_name
}
# @codepoints
end # end
EOS
end
# Loads the Unicode database and returns all the internal objects of
# UnicodeDatabase.
def
load
begin
@codepoints
,
@composition_exclusion
,
@composition_map
,
@boundary
,
@cp1252
=
File
.
open
(
self
.
class
.
filename
,
"rb"
)
{
|
f
|
Marshal
.
load
f
.
read
}
rescue
=>
e
raise
IOError
.
new
(
"Couldn't load the Unicode tables for UTF8Handler (
#{
e
.
message
}
), ActiveSupport::Multibyte is unusable"
)
end
# Redefine the === method so we can write shorter rules for grapheme cluster breaks
@boundary
.
each_key
do
|
k
|
@boundary
[
k
].
instance_eval
do
def
===
(
other
)
detect
{
|
i
|
i
===
other
}
?
true
:
false
end
end
if
@boundary
[
k
].
kind_of?
(
Array
)
end
# define attr_reader methods for the instance variables
class
<<
self
attr_reader
(
*
ATTRIBUTES
)
end
end
# Returns the directory in which the data files are stored.
def
self
.
dirname
File
.
expand_path
(
"../values"
,
__dir__
)
end
# Returns the filename for the data file for this version.
def
self
.
filename
File
.
expand_path
File
.
join
(
dirname
,
"unicode_tables.dat"
)
end
string
.
swapcase
end
private
def
apply_mapping
(
string
,
mapping
)
database
.
codepoints
string
.
each_codepoint
.
map
do
|
codepoint
|
cp
=
database
.
codepoints
[
codepoint
]
if
cp
&&
(
ncp
=
cp
.
send
(
mapping
))
&&
ncp
>
0
ncp
else
codepoint
end
end
.
pack
(
"U*"
)
end
def
recode_windows1252_chars
(
string
)
string
.
encode
(
Encoding
::
UTF_8
,
Encoding
::
Windows_1252
,
invalid: :replace
,
undef: :replace
)
end
def
database
@database
||=
UnicodeDatabase
.
new
end
end
end
end
activesupport/lib/active_support/values/unicode_tables.dat
已删除
100644 → 0
浏览文件 @
0f05c87e
文件已删除
activesupport/test/multibyte_unicode_database_test.rb
已删除
100644 → 0
浏览文件 @
0f05c87e
# frozen_string_literal: true
require
"abstract_unit"
class
MultibyteUnicodeDatabaseTest
<
ActiveSupport
::
TestCase
include
ActiveSupport
::
Multibyte
::
Unicode
def
setup
@ucd
=
UnicodeDatabase
.
new
end
UnicodeDatabase
::
ATTRIBUTES
.
each
do
|
attribute
|
define_method
"test_lazy_loading_on_attribute_access_of_
#{
attribute
}
"
do
assert_called
(
@ucd
,
:load
)
do
@ucd
.
send
(
attribute
)
end
end
end
def
test_load
@ucd
.
load
UnicodeDatabase
::
ATTRIBUTES
.
each
do
|
attribute
|
assert
@ucd
.
send
(
attribute
).
length
>
1
end
end
end
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录