...
 
Commits (17)
    https://gitcode.net/wjd2002/ncnn/-/commit/4c861a0d1a4569c0a8d7d14e9e163e71686e0745 Add Building with Intel oneAPI (#4920) 2023-08-06T21:41:12+08:00 mizu-bai shiragawa4519@outlook.com https://gitcode.net/wjd2002/ncnn/-/commit/0a8cf31a0583026f115e243dcced1fe901cdbbe3 Add POWER8 VSX toolchains (#4853) 2023-08-06T22:16:34+08:00 JeremyRand 244188+JeremyRand@users.noreply.github.com * Add POWER8 VSX toolchains POWER8, though slower than POWER9, is still used in the wild; these toolchains should still be much faster on POWER8 than POWER8 without VSX optimizations. * VSX toolchains: set -cpu arg in QEMU CI tests https://gitcode.net/wjd2002/ncnn/-/commit/60fedae38b2eeab557a846e4bdcb30b697778dae fix pnnx ghost reshape shape expression inputs, fix intmax overflow on fuse/e... 2023-08-07T17:28:15+08:00 nihui nihuini@tencent.com https://gitcode.net/wjd2002/ncnn/-/commit/285d0793d402763556f7f412dd1f0936a689b587 pnnx fuse expression for scalar-like attribute and unbind chain (#4928) 2023-08-10T14:24:24+08:00 nihui nihuini@tencent.com https://gitcode.net/wjd2002/ncnn/-/commit/4abadd2ffb75bf209ded4254771be077fc1847b6 binaryop implicit broadcast B with 1 dimension rank for outer axis (#4930) 2023-08-10T21:29:49+08:00 nihui nihuini@tencent.com https://gitcode.net/wjd2002/ncnn/-/commit/ffe1510c2f90134628a9751f600d59c10a98682d Bump pypa/cibuildwheel from 2.13.1 to 2.15.0 (#4926) 2023-08-11T11:11:21+08:00 dependabot[bot] 49699333+dependabot[bot]@users.noreply.github.com Bumps [pypa/cibuildwheel](<a href="https://github.com/pypa/cibuildwheel" rel="nofollow noreferrer noopener" target="_blank">https://github.com/pypa/cibuildwheel</a>) from 2.13.1 to 2.15.0. - [Release notes](<a href="https://github.com/pypa/cibuildwheel/releases" rel="nofollow noreferrer noopener" target="_blank">https://github.com/pypa/cibuildwheel/releases</a>) - [Changelog](<a href="https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md" rel="nofollow noreferrer noopener" target="_blank">https://github.com/pypa/cibuildwheel/blob/main/docs/changelog.md</a>) - [Commits](<a href="https://github.com/pypa/cibuildwheel/compare/v2.13.1...v2.15.0" rel="nofollow noreferrer noopener" target="_blank">https://github.com/pypa/cibuildwheel/compare/v2.13.1...v2.15.0</a>) --- updated-dependencies: - dependency-name: pypa/cibuildwheel dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: <span data-trailer="Signed-off-by:"><a href="mailto:support@github.com" title="support@github.com"></a><a href="javascript:void(0)" class="avatar s16 avatar-inline identicon bg5" style="text-decoration: none">N</a><a href="mailto:support@github.com" title="support@github.com">dependabot[bot]</a> &lt;<a href="mailto:support@github.com" title="support@github.com">support@github.com</a>&gt;</span> Co-authored-by: <span data-trailer="Co-authored-by:"><a href="mailto:49699333+dependabot%5Bbot%5D@users.noreply.github.com" title="49699333+dependabot[bot]@users.noreply.github.com"></a><a href="javascript:void(0)" class="avatar s16 avatar-inline identicon bg5" style="text-decoration: none">N</a><a href="mailto:49699333+dependabot%5Bbot%5D@users.noreply.github.com" title="49699333+dependabot[bot]@users.noreply.github.com">dependabot[bot]</a> &lt;<a href="mailto:49699333+dependabot%5Bbot%5D@users.noreply.github.com" title="49699333+dependabot[bot]@users.noreply.github.com">49699333+dependabot[bot]@users.noreply.github.com</a>&gt;</span> https://gitcode.net/wjd2002/ncnn/-/commit/a24787b32b32acb2d6d365a6bdd8426d92ad74d0 feat(benchmark/benchncnn.cpp): support user defined case (#4782) 2023-08-11T11:17:24+08:00 tpoisonooo khj.application@aliyun.com https://gitcode.net/wjd2002/ncnn/-/commit/75e10c6e6157b9c632199a22ae4951a292aa725d Support mac platform static library compilation (#4859) 2023-08-11T11:19:18+08:00 佰阅 43716063+Baiyuetribe@users.noreply.github.com https://gitcode.net/wjd2002/ncnn/-/commit/e80fcbca8f67cf107beebb4dd0333856879dc6fa prefer faster and larger device local only memory on amd integrated graphics,... 2023-08-12T19:43:30+08:00 nihui nihuini@tencent.com prefer faster and larger device local only memory on amd integrated graphics, heap budget value follows the same strategy as blob allocator (#4936) https://gitcode.net/wjd2002/ncnn/-/commit/070a6d40f27525427dd1c12153019a21f8fe9ac4 support torch.t to ncnn (#4940) 2023-08-14T15:46:31+08:00 WXB 64680548+XiaBing992@users.noreply.github.com https://gitcode.net/wjd2002/ncnn/-/commit/fed3b43c730d3ef6154beef1f1bb1c8f8fc68de1 Add logxxx to log comp xxx rewriter where xxx = sigmoid or softmax (#4925) 2023-08-15T17:21:39+08:00 lrw04 2428592483@qq.com * Add logxxx to log comp xxx rewriter * Use pattern matching for LogSigmoid and LogSoftmax * Add conversion passes for functional counterparts * Update documentation https://gitcode.net/wjd2002/ncnn/-/commit/93e395dc4b8f24b30d64e0ac08448223df10ce1b pnnx convert torch maximum minimum and torch max min as expression (#4944) 2023-08-15T17:22:44+08:00 nihui nihuini@tencent.com * reset device check dtype kind int * placeholder for ncnn sign * convert torch maximum minimum * torch.max as expression * torch.min as expression https://gitcode.net/wjd2002/ncnn/-/commit/00da9251b1d986ca0d20a34efd7a79b897731ff0 update python ci version (#4946) 2023-08-15T23:12:50+08:00 nihui nihuini@tencent.com https://gitcode.net/wjd2002/ncnn/-/commit/cbd838f670c94a589f60820e1cde0dc0af38bbb3 [docs] Clean comments and prints when find vulkan (#4948) 2023-08-15T23:43:49+08:00 Zhuo Zhang imzhuo@foxmail.com https://gitcode.net/wjd2002/ncnn/-/commit/39721eeb9400e33f4708a36f6eb8f61e2ad3d53c require c++17 for building with new protobuf (#4947) 2023-08-16T11:48:56+08:00 nihui nihuini@tencent.com https://gitcode.net/wjd2002/ncnn/-/commit/6b657a39cbee172a17b7ca8d66171197a17fd611 fix _mm512_i32gather_epi32 and other scatter/gather routines have incorrect s... 2023-08-19T22:56:19+08:00 青菜萝卜冬瓜 i@mail.chainsx.cn https://gitcode.net/wjd2002/ncnn/-/commit/cb674ac5eddb32f0709a60c81f71d2cbc6bc89da fix build with toolchain defined _L _U constants (#4957) 2023-08-21T10:48:45+08:00 nihui nihuini@tencent.com
......@@ -73,6 +73,52 @@ jobs:
export PATH=$GITHUB_WORKSPACE/qemu-install/bin:$PATH
cd build
TESTS_EXECUTABLE_LOADER=qemu-ppc64le TESTS_EXECUTABLE_LOADER_ARGUMENTS="-L;/usr/powerpc64le-linux-gnu" ctest --output-on-failure -j 2
linux-gcc-power8le-vsx:
runs-on: ubuntu-20.04
steps:
- uses: actions/checkout@v3
- name: cache-qemu
id: cache-qemu
uses: actions/cache@v3
with:
path: qemu-install
key: qemu-ppc64le-install-20220502-2
- name: install-qemu-build-deps
if: steps.cache-qemu.outputs.cache-hit != 'true'
run: |
sudo apt-get update
sudo apt-get install autoconf automake autotools-dev ninja-build
- name: checkout-qemu
if: steps.cache-qemu.outputs.cache-hit != 'true'
uses: actions/checkout@v3
with:
repository: qemu/qemu
path: qemu
ref: f5643914a9e8f79c606a76e6a9d7ea82a3fc3e65
- name: qemu
if: steps.cache-qemu.outputs.cache-hit != 'true'
run: |
cd qemu
./configure --prefix=$GITHUB_WORKSPACE/qemu-install --target-list=ppc64le-linux-user --disable-system
make -j2
make install
- name: powerpc64le-gnu-toolchain
run: |
sudo apt-get update
sudo apt-get install g++-powerpc64le-linux-gnu
- name: configure
run: mkdir build && cd build && cmake -DCMAKE_TOOLCHAIN_FILE=../toolchains/power8le-linux-gnu-vsx.toolchain.cmake -DNCNN_BUILD_TOOLS=OFF -DNCNN_BUILD_EXAMPLES=OFF -DNCNN_BUILD_TESTS=ON ..
- name: build
run: cmake --build build -j 2
- name: test
run: |
export PATH=$GITHUB_WORKSPACE/qemu-install/bin:$PATH
cd build
TESTS_EXECUTABLE_LOADER=qemu-ppc64le TESTS_EXECUTABLE_LOADER_ARGUMENTS="-L;/usr/powerpc64le-linux-gnu;-cpu;power8_v2.0" ctest --output-on-failure -j 2
linux-gcc-power9le-vsx:
runs-on: ubuntu-20.04
steps:
......@@ -118,4 +164,4 @@ jobs:
run: |
export PATH=$GITHUB_WORKSPACE/qemu-install/bin:$PATH
cd build
TESTS_EXECUTABLE_LOADER=qemu-ppc64le TESTS_EXECUTABLE_LOADER_ARGUMENTS="-L;/usr/powerpc64le-linux-gnu" ctest --output-on-failure -j 2
TESTS_EXECUTABLE_LOADER=qemu-ppc64le TESTS_EXECUTABLE_LOADER_ARGUMENTS="-L;/usr/powerpc64le-linux-gnu;-cpu;power9_v2.0" ctest --output-on-failure -j 2
......@@ -31,7 +31,7 @@ jobs:
runs-on: ubuntu-20.04
strategy:
matrix:
python-version: [3.6, 3.7, 3.8, 3.9]
python-version: [3.7, 3.9, 3.11]
steps:
- uses: actions/checkout@v3
......
......@@ -33,7 +33,7 @@ jobs:
runs-on: ubuntu-20.04
strategy:
matrix:
python-version: [3.6, 3.8]
python-version: [3.7, 3.9, 3.11]
steps:
- uses: actions/checkout@v3
with:
......
......@@ -38,7 +38,7 @@ jobs:
runs-on: macos-12
strategy:
matrix:
python-version: [3.6, 3.7, 3.8, 3.9]
python-version: [3.7, 3.9, 3.11]
steps:
- uses: actions/checkout@v3
with:
......
......@@ -68,7 +68,7 @@ jobs:
brew uninstall --ignore-dependencies libomp
- name: Build wheels
uses: pypa/cibuildwheel@v2.13.1
uses: pypa/cibuildwheel@v2.15.0
env:
CIBW_ARCHS_MACOS: ${{ matrix.arch }}
CIBW_ARCHS_LINUX: ${{ matrix.arch }}
......@@ -98,7 +98,7 @@ jobs:
fail-fast: false
matrix:
arch: [aarch64, ppc64le, s390x]
build: ['cp36-*', 'cp37-*', 'cp38-*', 'cp39-*', 'cp310-*', 'cp311-*']
build: ['cp36-*', 'cp37-*', 'cp38-*', 'cp39-*', 'cp310-*', 'cp311-*', 'cp312-*']
include:
- arch: aarch64
build: 'pp37-*'
......@@ -106,6 +106,8 @@ jobs:
build: 'pp38-*'
- arch: aarch64
build: 'pp39-*'
- arch: aarch64
build: 'pp310-*'
steps:
- uses: actions/checkout@v3
......@@ -122,7 +124,7 @@ jobs:
platforms: all
- name: Build wheels
uses: pypa/cibuildwheel@v2.13.1
uses: pypa/cibuildwheel@v2.15.0
env:
CIBW_ARCHS_LINUX: ${{ matrix.arch }}
CIBW_BUILD: ${{ matrix.build }}
......
......@@ -31,7 +31,7 @@ jobs:
runs-on: windows-latest
strategy:
matrix:
python-version: [3.6, 3.7, 3.8, 3.9]
python-version: [3.7, 3.9, 3.11]
env:
UseMultiToolTask: true
steps:
......
# See https://github.com/restyled-io/restyled.io/wiki/Configuring-Restyled
enabled: false
pull_requests: true
commit_template: |
[skip ci] Restyled by ${restyler.name}
exclude:
- "src/stb_image*"
statuses:
differences: true
no_differences: true
error: true
restylers:
- clang-format
- astyle
- clang-format
- astyle
......@@ -203,9 +203,9 @@ ncnn 目前已在腾讯多款应用中使用,如:QQ,Qzone,微信,天
## HowTo
**[how to build ncnn library](https://github.com/Tencent/ncnn/wiki/how-to-build) on Linux / Windows / macOS / Raspberry Pi3, Pi4 / Android / NVIDIA Jetson / iOS / WebAssembly / AllWinner D1 / Loongson 2K1000**
**[how to build ncnn library](https://github.com/Tencent/ncnn/wiki/how-to-build) on Linux / Windows / macOS / Raspberry Pi3, Pi4 / POWER / Android / NVIDIA Jetson / iOS / WebAssembly / AllWinner D1 / Loongson 2K1000**
- [Build for Linux / NVIDIA Jetson / Raspberry Pi3, Pi4 / POWER9](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-linux)
- [Build for Linux / NVIDIA Jetson / Raspberry Pi3, Pi4 / POWER](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-linux)
- [Build for Windows x64 using VS2017](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-windows-x64-using-visual-studio-community-2017)
- [Build for macOS](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-macos)
- [Build for ARM Cortex-A family with cross-compiling](https://github.com/Tencent/ncnn/wiki/how-to-build#build-for-arm-cortex-a-family-with-cross-compiling)
......
......@@ -4,7 +4,7 @@ Only the network definition files (ncnn param) are required.
The large model binary files (ncnn bin) are not loaded but generated randomly for speed test.
More model networks may be added later.
If no model specified, it would benchmark default list. More model networks may be added later.
---
Build
......@@ -23,7 +23,9 @@ make -j4
Usage
```shell
# copy all param files to the current directory
./benchncnn [loop count] [num threads] [powersave] [gpu device] [cooling down]
./benchncnn [loop count] [num threads] [powersave] [gpu device] [cooling down] [(key=value)...]
param=model.param
shape=[227,227,3],..
```
run benchncnn on android device
```shell
......@@ -34,7 +36,9 @@ adb shell
# executed in android adb shell
cd /data/local/tmp/
./benchncnn [loop count] [num threads] [powersave] [gpu device] [cooling down]
./benchncnn [loop count] [num threads] [powersave] [gpu device] [cooling down] [(key=value)...]
param=model.param
shape=[227,227,3],..
```
Parameter
......@@ -46,7 +50,8 @@ Parameter
|powersave|0=all cores, 1=little cores only, 2=big cores only|0|
|gpu device|-1=cpu-only, 0=gpu0, 1=gpu1 ...|-1|
|cooling down|0=disable, 1=enable|1|
|param|ncnn model.param filepath|-|
|shape|model input shapes with, whc format|-|
Tips: Disable android UI server and set CPU and GPU to max frequency
```shell
......
......@@ -25,6 +25,7 @@
#include "datareader.h"
#include "net.h"
#include "gpu.h"
#include <vector>
class DataReaderFromEmpty : public ncnn::DataReader
{
......@@ -53,11 +54,8 @@ static ncnn::VkAllocator* g_blob_vkallocator = 0;
static ncnn::VkAllocator* g_staging_vkallocator = 0;
#endif // NCNN_VULKAN
void benchmark(const char* comment, const ncnn::Mat& _in, const ncnn::Option& opt)
void benchmark(const char* comment, const std::vector<ncnn::Mat>& _in, const ncnn::Option& opt, bool fixed_path = true)
{
ncnn::Mat in = _in;
in.fill(0.01f);
g_blob_pool_allocator.clear();
g_workspace_pool_allocator.clear();
......@@ -86,9 +84,16 @@ void benchmark(const char* comment, const ncnn::Mat& _in, const ncnn::Option& op
#define MODEL_DIR ""
#endif
char parampath[256];
sprintf(parampath, MODEL_DIR "%s.param", comment);
net.load_param(parampath);
if (fixed_path)
{
char parampath[256];
sprintf(parampath, MODEL_DIR "%s.param", comment);
net.load_param(parampath);
}
else
{
net.load_param(comment);
}
DataReaderFromEmpty dr;
net.load_model(dr);
......@@ -102,14 +107,34 @@ void benchmark(const char* comment, const ncnn::Mat& _in, const ncnn::Option& op
ncnn::sleep(10 * 1000);
}
ncnn::Mat out;
if (input_names.size() > _in.size())
{
fprintf(stderr, "input %ld tensors while model has %ld inputs\n", _in.size(), input_names.size());
return;
}
// initialize input
for (size_t j = 0; j < input_names.size(); ++j)
{
ncnn::Mat in = _in[j];
in.fill(0.01f);
}
// warm up
for (int i = 0; i < g_warmup_loop_count; i++)
{
ncnn::Extractor ex = net.create_extractor();
ex.input(input_names[0], in);
ex.extract(output_names[0], out);
for (size_t j = 0; j < input_names.size(); ++j)
{
ncnn::Mat in = _in[j];
ex.input(input_names[j], in);
}
for (size_t j = 0; j < output_names.size(); ++j)
{
ncnn::Mat out;
ex.extract(output_names[j], out);
}
}
double time_min = DBL_MAX;
......@@ -119,11 +144,19 @@ void benchmark(const char* comment, const ncnn::Mat& _in, const ncnn::Option& op
for (int i = 0; i < g_loop_count; i++)
{
double start = ncnn::get_current_time();
{
ncnn::Extractor ex = net.create_extractor();
ex.input(input_names[0], in);
ex.extract(output_names[0], out);
for (size_t j = 0; j < input_names.size(); ++j)
{
ncnn::Mat in = _in[j];
ex.input(input_names[j], in);
}
for (size_t j = 0; j < output_names.size(); ++j)
{
ncnn::Mat out;
ex.extract(output_names[j], out);
}
}
double end = ncnn::get_current_time();
......@@ -140,6 +173,79 @@ void benchmark(const char* comment, const ncnn::Mat& _in, const ncnn::Option& op
fprintf(stderr, "%20s min = %7.2f max = %7.2f avg = %7.2f\n", comment, time_min, time_max, time_avg);
}
void benchmark(const char* comment, const ncnn::Mat& _in, const ncnn::Option& opt, bool fixed_path = true)
{
std::vector<ncnn::Mat> inputs;
inputs.push_back(_in);
return benchmark(comment, inputs, opt, fixed_path);
}
void show_usage()
{
fprintf(stderr, "Usage: benchncnn [loop count] [num threads] [powersave] [gpu device] [cooling down] [(key=value)...]\n");
fprintf(stderr, " param=model.param\n");
fprintf(stderr, " shape=[227,227,3],...\n");
}
static std::vector<ncnn::Mat> parse_shape_list(char* s)
{
std::vector<std::vector<int> > shapes;
std::vector<ncnn::Mat> mats;
char* pch = strtok(s, "[]");
while (pch != NULL)
{
// parse a,b,c
int v;
int nconsumed = 0;
int nscan = sscanf(pch, "%d%n", &v, &nconsumed);
if (nscan == 1)
{
// ok we get shape
pch += nconsumed;
std::vector<int> s;
s.push_back(v);
nscan = sscanf(pch, ",%d%n", &v, &nconsumed);
while (nscan == 1)
{
pch += nconsumed;
s.push_back(v);
nscan = sscanf(pch, ",%d%n", &v, &nconsumed);
}
// shape end
shapes.push_back(s);
}
pch = strtok(NULL, "[]");
}
for (size_t i = 0; i < shapes.size(); ++i)
{
const std::vector<int>& shape = shapes[i];
switch (shape.size())
{
case 3:
mats.push_back(ncnn::Mat(shape[0], shape[1], shape[2]));
break;
case 2:
mats.push_back(ncnn::Mat(shape[0], shape[1]));
break;
case 1:
mats.push_back(ncnn::Mat(shape[0]));
break;
default:
fprintf(stderr, "unsupported input shape size %ld\n", shape.size());
break;
}
}
return mats;
}
int main(int argc, char** argv)
{
int loop_count = 4;
......@@ -147,6 +253,23 @@ int main(int argc, char** argv)
int powersave = 2;
int gpu_device = -1;
int cooling_down = 1;
char* model = 0;
std::vector<ncnn::Mat> inputs;
for (int i = 1; i < argc; i++)
{
if (argv[i][0] == '-' && argv[i][1] == 'h')
{
show_usage();
return -1;
}
if (strcmp(argv[i], "--help") == 0)
{
show_usage();
return -1;
}
}
if (argc >= 2)
{
......@@ -169,6 +292,35 @@ int main(int argc, char** argv)
cooling_down = atoi(argv[5]);
}
for (int i = 6; i < argc; i++)
{
// key=value
char* kv = argv[i];
char* eqs = strchr(kv, '=');
if (eqs == NULL)
{
fprintf(stderr, "unrecognized arg %s\n", kv);
continue;
}
// split k v
eqs[0] = '\0';
const char* key = kv;
char* value = eqs + 1;
if (strcmp(key, "param") == 0)
model = value;
if (strcmp(key, "shape") == 0)
inputs = parse_shape_list(value);
}
if (model && inputs.empty())
{
fprintf(stderr, "input tensor shape empty!\n");
return -1;
}
#ifdef __EMSCRIPTEN__
EM_ASM(
FS.mkdir('/working');
......@@ -231,78 +383,86 @@ int main(int argc, char** argv)
fprintf(stderr, "gpu_device = %d\n", gpu_device);
fprintf(stderr, "cooling_down = %d\n", (int)g_enable_cooling_down);
// run
benchmark("squeezenet", ncnn::Mat(227, 227, 3), opt);
if (model != 0)
{
// run user defined benchmark
benchmark(model, inputs, opt, false);
}
else
{
// run default cases
benchmark("squeezenet", ncnn::Mat(227, 227, 3), opt);
benchmark("squeezenet_int8", ncnn::Mat(227, 227, 3), opt);
benchmark("squeezenet_int8", ncnn::Mat(227, 227, 3), opt);
benchmark("mobilenet", ncnn::Mat(224, 224, 3), opt);
benchmark("mobilenet", ncnn::Mat(224, 224, 3), opt);
benchmark("mobilenet_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("mobilenet_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("mobilenet_v2", ncnn::Mat(224, 224, 3), opt);
benchmark("mobilenet_v2", ncnn::Mat(224, 224, 3), opt);
// benchmark("mobilenet_v2_int8", ncnn::Mat(224, 224, 3), opt);
// benchmark("mobilenet_v2_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("mobilenet_v3", ncnn::Mat(224, 224, 3), opt);
benchmark("mobilenet_v3", ncnn::Mat(224, 224, 3), opt);
benchmark("shufflenet", ncnn::Mat(224, 224, 3), opt);
benchmark("shufflenet", ncnn::Mat(224, 224, 3), opt);
benchmark("shufflenet_v2", ncnn::Mat(224, 224, 3), opt);
benchmark("shufflenet_v2", ncnn::Mat(224, 224, 3), opt);
benchmark("mnasnet", ncnn::Mat(224, 224, 3), opt);
benchmark("mnasnet", ncnn::Mat(224, 224, 3), opt);
benchmark("proxylessnasnet", ncnn::Mat(224, 224, 3), opt);
benchmark("proxylessnasnet", ncnn::Mat(224, 224, 3), opt);
benchmark("efficientnet_b0", ncnn::Mat(224, 224, 3), opt);
benchmark("efficientnet_b0", ncnn::Mat(224, 224, 3), opt);
benchmark("efficientnetv2_b0", ncnn::Mat(224, 224, 3), opt);
benchmark("efficientnetv2_b0", ncnn::Mat(224, 224, 3), opt);
benchmark("regnety_400m", ncnn::Mat(224, 224, 3), opt);
benchmark("regnety_400m", ncnn::Mat(224, 224, 3), opt);
benchmark("blazeface", ncnn::Mat(128, 128, 3), opt);
benchmark("blazeface", ncnn::Mat(128, 128, 3), opt);
benchmark("googlenet", ncnn::Mat(224, 224, 3), opt);
benchmark("googlenet", ncnn::Mat(224, 224, 3), opt);
benchmark("googlenet_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("googlenet_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("resnet18", ncnn::Mat(224, 224, 3), opt);
benchmark("resnet18", ncnn::Mat(224, 224, 3), opt);
benchmark("resnet18_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("resnet18_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("alexnet", ncnn::Mat(227, 227, 3), opt);
benchmark("alexnet", ncnn::Mat(227, 227, 3), opt);
benchmark("vgg16", ncnn::Mat(224, 224, 3), opt);
benchmark("vgg16", ncnn::Mat(224, 224, 3), opt);
benchmark("vgg16_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("vgg16_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("resnet50", ncnn::Mat(224, 224, 3), opt);
benchmark("resnet50", ncnn::Mat(224, 224, 3), opt);
benchmark("resnet50_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("resnet50_int8", ncnn::Mat(224, 224, 3), opt);
benchmark("squeezenet_ssd", ncnn::Mat(300, 300, 3), opt);
benchmark("squeezenet_ssd", ncnn::Mat(300, 300, 3), opt);
benchmark("squeezenet_ssd_int8", ncnn::Mat(300, 300, 3), opt);
benchmark("squeezenet_ssd_int8", ncnn::Mat(300, 300, 3), opt);
benchmark("mobilenet_ssd", ncnn::Mat(300, 300, 3), opt);
benchmark("mobilenet_ssd", ncnn::Mat(300, 300, 3), opt);
benchmark("mobilenet_ssd_int8", ncnn::Mat(300, 300, 3), opt);
benchmark("mobilenet_ssd_int8", ncnn::Mat(300, 300, 3), opt);
benchmark("mobilenet_yolo", ncnn::Mat(416, 416, 3), opt);
benchmark("mobilenet_yolo", ncnn::Mat(416, 416, 3), opt);
benchmark("mobilenetv2_yolov3", ncnn::Mat(352, 352, 3), opt);
benchmark("mobilenetv2_yolov3", ncnn::Mat(352, 352, 3), opt);
benchmark("yolov4-tiny", ncnn::Mat(416, 416, 3), opt);
benchmark("yolov4-tiny", ncnn::Mat(416, 416, 3), opt);
benchmark("nanodet_m", ncnn::Mat(320, 320, 3), opt);
benchmark("nanodet_m", ncnn::Mat(320, 320, 3), opt);
benchmark("yolo-fastest-1.1", ncnn::Mat(320, 320, 3), opt);
benchmark("yolo-fastest-1.1", ncnn::Mat(320, 320, 3), opt);
benchmark("yolo-fastestv2", ncnn::Mat(352, 352, 3), opt);
benchmark("yolo-fastestv2", ncnn::Mat(352, 352, 3), opt);
benchmark("vision_transformer", ncnn::Mat(384, 384, 3), opt);
benchmark("vision_transformer", ncnn::Mat(384, 384, 3), opt);
benchmark("FastestDet", ncnn::Mat(352, 352, 3), opt);
benchmark("FastestDet", ncnn::Mat(352, 352, 3), opt);
}
#if NCNN_VULKAN
delete g_blob_vkallocator;
delete g_staging_vkallocator;
......
......@@ -65,3 +65,15 @@ pnnx will insert reshape operator at the appropriate position to convert it to e
|[2,3,4,5]|[5]|[2,3,4,5]|
|[2,3,4,5]|[4,5]|[2,3,4,5]|
|[2,3,4,5]|[3,4,5]|[2,3,4,5]|
* implicit broadcast B with 1 dimension rank for outer axis
This exists only for compatibility.
When the size is the same, eg. [2,2] and [2], broadcast B for inner axis will be prioritized.
|A|B|C|
|---|---|---|
|[2,3]|[2]|[2,3]|
|[2,3,4]|[2]|[2,3,4]|
|[2,3,4,5]|[2]|[2,3,4,5]|
......@@ -10,7 +10,8 @@ git submodule update --init
- [Build for Linux](#build-for-linux)
- [Nvidia Jetson](#nvidia-jetson)
- [Raspberry Pi](#raspberry-pi)
- [POWER9](#power9)
- [POWER](#power)
- [Intel oneAPI](#intel-oneapi)
- [Verification](#verification)
- [Build for Windows x64 using Visual Studio Community 2017](#build-for-windows-x64-using-visual-studio-community-2017)
- [Build for macOS](#build-for-macos)
......@@ -88,9 +89,9 @@ You can add `-GNinja` to `cmake` above to use Ninja build system (invoke build u
For Rasberry Pi 3 on 32bit OS, add `-DCMAKE_TOOLCHAIN_FILE=../toolchains/pi3.toolchain.cmake` to cmake. You can also consider disabling Vulkan support as the Vulkan drivers for Rasberry Pi are still not mature, but it doesn't hurt to build the support in, but not use it.
#### POWER9
#### POWER
With Clang 13 or higher:
For POWER9 with Clang 13 or higher:
```shell
cd ncnn
......@@ -102,7 +103,17 @@ make -j$(nproc)
Earlier versions of Clang may fail to build ncnn due to [Bug 49864](https://github.com/llvm/llvm-project/issues/49864). To use GCC instead, use the `power9le-linux-gnu-vsx.toolchain.cmake` toolchain file instead. Note that according to benchmarks, Clang appears to produce noticeably faster CPU inference than GCC for POWER9 targets.
Note that the POWER9 toolchain files only support little-endian mode.
For POWER8 instead of POWER9, use the `power8le-linux-gnu-vsx.clang.toolchain.cmake` or `power8le-linux-gnu-vsx.toolchain.cmake` toolchain file instead. POWER8 will be slower than POWER9.
Note that the POWER toolchain files only support little-endian mode.
#### Intel oneAPI
Besides the prerequests in this section, Intel oneAPI BaseKit and HPCKit should be installed. They are available from https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit.html and https://www.intel.com/content/www/us/en/developer/tools/oneapi/hpc-toolkit.html freely.
Intel oneAPI offers two kinds of compilers, the classic `icc/icpc` and the LLVM based `icx/icpx`. To build with these compilers, add `CC=icc CXX=icpc` or `CC=icx CXX=icpx` before the `cmake` command. When compiling with `icc/icpc`, cmake will warn that `xop`, `avx512`, and `bf16` extensions are not supported by the compiler, while `icx/icpx` works well.
Both of these compilers have been tested and passed the ncnn benchmark successfully. The results have been included in ncnn benchmark readme. Generally, `icx/icpx` are likely to show better performance than `icc/icpc` and the quantized models can benefit from the extensions `icx/icpx` supports.
#### Verification
......
......@@ -265,18 +265,17 @@ endif()
if(NCNN_VULKAN)
find_package(Vulkan QUIET)
if(NOT Vulkan_FOUND)
message(STATUS "=== CMAKE_SYSTEM_NAME is: ${CMAKE_SYSTEM_NAME}")
if(DEFINED ENV{VULKAN_SDK})
if(CMAKE_SYSTEM_NAME MATCHES "Linux")
list(APPEND CMAKE_MODULE_PATH "$ENV{VULKAN_SDK}/../source/VulkanTools/cmake")
elseif(CMAKE_SYSTEM_NAME MATCHES "Windows")
list(APPEND CMAKE_MODULE_PATH "$ENV{VULKAN_SDK}/Samples/cmake")
elseif(CMAKE_SYSTEM_NAME MATCHES "Darwin")
message(WARNING "Failed to find vulkan since cmake too old\n"
message(WARNING "Failed to find vulkan since cmake is too old\n"
"cmake >= 3.7 required. Consider `brew upgrade cmake`")
endif()
else()
message(FATAL_ERROR "!! CMake didn't find Vulkan. Please set VULKAN_SDK env var, e.g.:\n"
message(FATAL_ERROR "Error: CMake didn't find Vulkan. Please set VULKAN_SDK env var, e.g.:\n"
"Linux: export VULKAN_SDK=~/soft/vulkansdk/1.2.148.0/x86_64\n"
"Windows: set VULKAN_SDK=E:/lib/VulkanSDK/1.2.148.0\n"
"MacOS: export VULKAN_SDK=~/soft/vulkansdk/1.2.148.0/macOS\n"
......@@ -286,7 +285,28 @@ if(NCNN_VULKAN)
endif()
target_link_libraries(ncnn PUBLIC Vulkan::Vulkan)
# Support mac platform static library compilation
if(NOT NCNN_SHARED_LIB AND CMAKE_HOST_SYSTEM_NAME STREQUAL "Darwin" AND NOT CMAKE_SYSTEM_NAME STREQUAL "iOS")
find_library(CoreFoundation NAMES CoreFoundation)
find_library(Foundation NAMES Foundation)
find_library(QuartzCore NAMES QuartzCore)
find_library(CoreGraphics NAMES CoreGraphics)
find_library(Cocoa NAMES Cocoa)
find_library(Metal NAMES Metal)
find_library(IOKit NAMES IOKit)
find_library(IOSurface NAMES IOSurface)
list(APPEND vulkan_dependent_LINK_LIBRARIES
${Metal}
${IOKit}
${IOSurface}
${QuartzCore}
${CoreGraphics}
${Cocoa}
${Foundation}
${CoreFoundation}
)
target_link_libraries(ncnn PRIVATE ${vulkan_dependent_LINK_LIBRARIES})
endif()
target_include_directories(ncnn PRIVATE $<BUILD_INTERFACE:${CMAKE_CURRENT_SOURCE_DIR}/../>)
target_link_libraries(ncnn PRIVATE glslang SPIRV)
endif()
......
......@@ -738,6 +738,16 @@ VkBufferMemory* VkBlobAllocator::fastMalloc(size_t size)
{
// integrated gpu, prefer unified memory
buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);
// on amd integrated gpu, there is a faster and larger device-only heap
uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);
const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physical_device_memory_properties();
uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;
uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;
if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)
{
buffer_memory_type_index = device_local_memory_type_index;
}
}
else
{
......@@ -990,6 +1000,16 @@ VkImageMemory* VkBlobAllocator::fastMalloc(int w, int h, int c, size_t elemsize,
{
// integrated gpu, prefer unified memory
image_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);
// on amd integrated gpu, there is a faster and larger device-only heap
uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);
const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physical_device_memory_properties();
uint32_t buffer_heap_index = memory_properties.memoryTypes[image_memory_type_index].heapIndex;
uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;
if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)
{
image_memory_type_index = device_local_memory_type_index;
}
}
else
{
......@@ -1299,6 +1319,16 @@ VkBufferMemory* VkWeightAllocator::fastMalloc(size_t size)
{
// integrated gpu, prefer unified memory
buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);
// on amd integrated gpu, there is a faster and larger device-only heap
uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);
const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physical_device_memory_properties();
uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;
uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;
if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)
{
buffer_memory_type_index = device_local_memory_type_index;
}
}
else
{
......@@ -1348,6 +1378,16 @@ VkBufferMemory* VkWeightAllocator::fastMalloc(size_t size)
{
// integrated gpu, prefer unified memory
buffer_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);
// on amd integrated gpu, there is a faster and larger device-only heap
uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);
const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physical_device_memory_properties();
uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;
uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;
if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)
{
buffer_memory_type_index = device_local_memory_type_index;
}
}
else
{
......@@ -1484,6 +1524,16 @@ VkImageMemory* VkWeightAllocator::fastMalloc(int w, int h, int c, size_t elemsiz
{
// integrated gpu, prefer unified memory
image_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);
// on amd integrated gpu, there is a faster and larger device-only heap
uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements2.memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);
const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physical_device_memory_properties();
uint32_t buffer_heap_index = memory_properties.memoryTypes[image_memory_type_index].heapIndex;
uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;
if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)
{
image_memory_type_index = device_local_memory_type_index;
}
}
else
{
......@@ -1578,6 +1628,16 @@ VkImageMemory* VkWeightAllocator::fastMalloc(int w, int h, int c, size_t elemsiz
{
// integrated gpu, prefer unified memory
image_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT, 0);
// on amd integrated gpu, there is a faster and larger device-only heap
uint32_t device_local_memory_type_index = vkdev->find_memory_index(memoryRequirements.memoryTypeBits, VK_MEMORY_PROPERTY_DEVICE_LOCAL_BIT, 0, VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT);
const VkPhysicalDeviceMemoryProperties& memory_properties = vkdev->info.physical_device_memory_properties();
uint32_t buffer_heap_index = memory_properties.memoryTypes[image_memory_type_index].heapIndex;
uint32_t device_local_heap_index = memory_properties.memoryTypes[device_local_memory_type_index].heapIndex;
if (device_local_heap_index < buffer_heap_index && memory_properties.memoryHeaps[device_local_heap_index].size > memory_properties.memoryHeaps[buffer_heap_index].size)
{
image_memory_type_index = device_local_memory_type_index;
}
}
else
{
......
......@@ -3153,23 +3153,13 @@ uint32_t VulkanDevice::get_heap_budget() const
{
const VkPhysicalDeviceMemoryProperties& memory_properties = info.physical_device_memory_properties();
// the first device local heap
uint32_t device_local_heap_index = 0;
uint32_t device_local_heap_size = 0;
for (uint32_t i = 0; i < memory_properties.memoryTypeCount; i++)
{
const VkMemoryHeap& memoryHeap = memory_properties.memoryHeaps[i];
if (memoryHeap.flags & VK_MEMORY_HEAP_DEVICE_LOCAL_BIT)
{
device_local_heap_index = i;
device_local_heap_size = memoryHeap.size / 1024 / 1024;
break;
}
}
uint32_t buffer_memory_type_index = d->dummy_allocator->buffer_memory_type_index;
uint32_t buffer_heap_index = memory_properties.memoryTypes[buffer_memory_type_index].heapIndex;
if (!info.support_VK_EXT_memory_budget())
{
// NCNN_LOGE("heap budget from assumption\n");
uint32_t device_local_heap_size = memory_properties.memoryHeaps[buffer_heap_index].size / 1024 / 1024;
// we usually cannot use all heap
// 70% for 4G+
......@@ -3187,7 +3177,7 @@ uint32_t VulkanDevice::get_heap_budget() const
vkGetPhysicalDeviceMemoryProperties2KHR(info.physical_device(), &memoryProperties);
return memoryBudgetProperties.heapBudget[device_local_heap_index] / 1024 / 1024;
return memoryBudgetProperties.heapBudget[buffer_heap_index] / 1024 / 1024;
}
void VulkanDevice::convert_packing(const VkMat& src, VkMat& dst, int dst_elempack, VkCompute& cmd, const Option& _opt) const
......
......@@ -485,13 +485,46 @@ int BinaryOp_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.h * B.elempack)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 2;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 3;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 4;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 3)
......@@ -501,13 +534,46 @@ int BinaryOp_arm::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.h * A.elempack)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 2;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 3;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 4;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 3)
......@@ -986,13 +1052,46 @@ int BinaryOp_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vecto
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.h * B.elempack)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 2;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 3;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 4;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 3)
......@@ -1002,13 +1101,46 @@ int BinaryOp_arm::forward_bf16s(const std::vector<Mat>& bottom_blobs, std::vecto
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.h * A.elempack)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 2;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 3;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 4;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 3)
......
......@@ -516,13 +516,46 @@ int BinaryOp_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vecto
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.h * B.elempack)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 2;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 3;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 4;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 3)
......@@ -532,13 +565,46 @@ int BinaryOp_arm::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vecto
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.h * A.elempack)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 2;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 3;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 4;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 3)
......
此差异已折叠。
......@@ -57,8 +57,8 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
const __fp16* weight_xc_RUN = weight_xc.row<const __fp16>(q / 4);
const __fp16* weight_hc_RUN = weight_hc.row<const __fp16>(q / 4);
float32x4_t _R = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN));
float32x4_t _U = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 4));
float32x4_t _gru_R = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN));
float32x4_t _gru_U = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 4));
float32x4_t _sum1 = vdupq_n_f32(0.f);
float32x4_t _sum2 = vdupq_n_f32(0.f);
float32x4_t _sum3 = vdupq_n_f32(0.f);
......@@ -78,8 +78,8 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _weight_xc_U_2 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 20));
float32x4_t _weight_xc_R_3 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 24));
float32x4_t _weight_xc_U_3 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 28));
_R = vfmaq_laneq_f32(_R, _weight_xc_R, _xi, 0);
_U = vfmaq_laneq_f32(_U, _weight_xc_U, _xi, 0);
_gru_R = vfmaq_laneq_f32(_gru_R, _weight_xc_R, _xi, 0);
_gru_U = vfmaq_laneq_f32(_gru_U, _weight_xc_U, _xi, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_R_1, _xi, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_U_1, _xi, 1);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_R_2, _xi, 2);
......@@ -96,8 +96,8 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _xi = vcvt_f32_f16(vdup_n_f16(xi));
float32x4_t _weight_xc_R = vcvt_f32_f16(vld1_f16(weight_xc_RUN));
float32x4_t _weight_xc_U = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 4));
_R = vmlaq_f32(_R, _weight_xc_R, _xi);
_U = vmlaq_f32(_U, _weight_xc_U, _xi);
_gru_R = vmlaq_f32(_gru_R, _weight_xc_R, _xi);
_gru_U = vmlaq_f32(_gru_U, _weight_xc_U, _xi);
weight_xc_RUN += 8;
}
......@@ -114,8 +114,8 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _weight_hc_U_2 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 20));
float32x4_t _weight_hc_R_3 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 24));
float32x4_t _weight_hc_U_3 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 28));
_R = vfmaq_laneq_f32(_R, _weight_hc_R, _h_cont, 0);
_U = vfmaq_laneq_f32(_U, _weight_hc_U, _h_cont, 0);
_gru_R = vfmaq_laneq_f32(_gru_R, _weight_hc_R, _h_cont, 0);
_gru_U = vfmaq_laneq_f32(_gru_U, _weight_hc_U, _h_cont, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_R_1, _h_cont, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_U_1, _h_cont, 1);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_R_2, _h_cont, 2);
......@@ -132,26 +132,26 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _h_cont = vdupq_n_f32(h_cont);
float32x4_t _weight_hc_R = vcvt_f32_f16(vld1_f16(weight_hc_RUN));
float32x4_t _weight_hc_U = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 4));
_R = vmlaq_f32(_R, _weight_hc_R, _h_cont);
_U = vmlaq_f32(_U, _weight_hc_U, _h_cont);
_gru_R = vmlaq_f32(_gru_R, _weight_hc_R, _h_cont);
_gru_U = vmlaq_f32(_gru_U, _weight_hc_U, _h_cont);
weight_hc_RUN += 8;
}
_R = vaddq_f32(_R, _sum1);
_U = vaddq_f32(_U, _sum2);
_gru_R = vaddq_f32(_gru_R, _sum1);
_gru_U = vaddq_f32(_gru_U, _sum2);
_sum3 = vaddq_f32(_sum3, _sum5);
_sum4 = vaddq_f32(_sum4, _sum6);
_R = vaddq_f32(_R, _sum3);
_U = vaddq_f32(_U, _sum4);
_gru_R = vaddq_f32(_gru_R, _sum3);
_gru_U = vaddq_f32(_gru_U, _sum4);
// sigmoid(R)
// sigmoid(U)
_R = sigmoid_ps(_R);
_U = sigmoid_ps(_U);
_gru_R = sigmoid_ps(_gru_R);
_gru_U = sigmoid_ps(_gru_U);
// gate new
float32x4_t _N = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 8));
float32x4_t _gru_N = vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 8));
_sum1 = vdupq_n_f32(0.f);
_sum2 = vdupq_n_f32(0.f);
_sum3 = vdupq_n_f32(0.f);
......@@ -164,7 +164,7 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _weight_hc_N_1 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 4));
float32x4_t _weight_hc_N_2 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 8));
float32x4_t _weight_hc_N_3 = vcvt_f32_f16(vld1_f16(weight_hc_RUN + 12));
_N = vfmaq_laneq_f32(_N, _weight_hc_N, _h_cont, 0);
_gru_N = vfmaq_laneq_f32(_gru_N, _weight_hc_N, _h_cont, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_N_1, _h_cont, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_N_2, _h_cont, 2);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_N_3, _h_cont, 3);
......@@ -177,16 +177,16 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _h_cont = vdupq_n_f32(h_cont);
float32x4_t _weight_hc_N = vcvt_f32_f16(vld1_f16(weight_hc_RUN));
_N = vmlaq_f32(_N, _weight_hc_N, _h_cont);
_gru_N = vmlaq_f32(_gru_N, _weight_hc_N, _h_cont);
weight_hc_RUN += 4;
}
_N = vaddq_f32(_N, _sum1);
_gru_N = vaddq_f32(_gru_N, _sum1);
_sum2 = vaddq_f32(_sum2, _sum3);
_N = vaddq_f32(_N, _sum2);
_gru_N = vaddq_f32(_gru_N, _sum2);
_N = vmlaq_f32(vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 12)), _R, _N);
_gru_N = vmlaq_f32(vcvt_f32_f16(vld1_f16(bias_c_RUBNWN + 12)), _gru_R, _gru_N);
_sum1 = vdupq_n_f32(0.f);
_sum2 = vdupq_n_f32(0.f);
_sum3 = vdupq_n_f32(0.f);
......@@ -199,7 +199,7 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _weight_xc_N_1 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 4));
float32x4_t _weight_xc_N_2 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 8));
float32x4_t _weight_xc_N_3 = vcvt_f32_f16(vld1_f16(weight_xc_RUN + 12));
_N = vfmaq_laneq_f32(_N, _weight_xc_N, _xi, 0);
_gru_N = vfmaq_laneq_f32(_gru_N, _weight_xc_N, _xi, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_N_1, _xi, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_N_2, _xi, 2);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_N_3, _xi, 3);
......@@ -212,22 +212,22 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _xi = vcvt_f32_f16(vdup_n_f16(xi));
float32x4_t _weight_xc_N = vcvt_f32_f16(vld1_f16(weight_xc_RUN));
_N = vmlaq_f32(_N, _weight_xc_N, _xi);
_gru_N = vmlaq_f32(_gru_N, _weight_xc_N, _xi);
weight_xc_RUN += 4;
}
_N = vaddq_f32(_N, _sum1);
_gru_N = vaddq_f32(_gru_N, _sum1);
_sum2 = vaddq_f32(_sum2, _sum3);
_N = vaddq_f32(_N, _sum2);
_gru_N = vaddq_f32(_gru_N, _sum2);
// tanh(N)
_N = tanh_ps(_N);
_gru_N = tanh_ps(_gru_N);
float* gates_data = gates.row(q / 4);
vst1q_f32(gates_data, _U);
vst1q_f32(gates_data + 4, _N);
vst1q_f32(gates_data, _gru_U);
vst1q_f32(gates_data + 4, _gru_N);
}
#pragma omp parallel for num_threads(opt.num_threads)
for (int q = remain_num_output_start; q < num_output; q++)
......@@ -314,13 +314,13 @@ static int gru_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
const float* gates_data = gates.row(q / 4);
float32x4_t _U = vld1q_f32(gates_data);
float32x4_t _N = vld1q_f32(gates_data + 4);
float32x4_t _gru_U = vld1q_f32(gates_data);
float32x4_t _gru_N = vld1q_f32(gates_data + 4);
float32x4_t _H = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _U), _N), vmulq_f32(_U, vld1q_f32(hidden_ptr + q)));
float32x4_t _gru_H = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _gru_U), _gru_N), vmulq_f32(_gru_U, vld1q_f32(hidden_ptr + q)));
vst1q_f32(hidden_ptr + q, _H);
vst1_f16(output_data + q, vcvt_f16_f32(_H));
vst1q_f32(hidden_ptr + q, _gru_H);
vst1_f16(output_data + q, vcvt_f16_f32(_gru_H));
}
#pragma omp parallel for num_threads(opt.num_threads)
for (int q = remain_num_output_start; q < num_output; q++)
......@@ -463,7 +463,7 @@ static int gru_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
hidden_ptr = hidden_state;
// gate new
float16x4_t _N = vld1_f16(bias_c_RUBNWN + 8);
float16x4_t _gru_N = vld1_f16(bias_c_RUBNWN + 8);
float16x4_t _sum4 = vdup_n_f16((__fp16)0.f);
float16x4_t _sum5 = vdup_n_f16((__fp16)0.f);
float16x4_t _sum6 = vdup_n_f16((__fp16)0.f);
......@@ -481,13 +481,13 @@ static int gru_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
"fmla %5.4h, v3.4h, v4.h[3] \n"
: "=r"(hidden_ptr),
"=r"(weight_hc_RUN),
"=w"(_N),
"=w"(_gru_N),
"=w"(_sum4),
"=w"(_sum5),
"=w"(_sum6)
: "0"(hidden_ptr),
"1"(weight_hc_RUN),
"2"(_N),
"2"(_gru_N),
"3"(_sum4),
"4"(_sum5),
"5"(_sum6)
......@@ -499,16 +499,16 @@ static int gru_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float16x4_t _h_cont = vdup_n_f16((__fp16)h_cont);
float16x4_t _weight_hc_N = vld1_f16(weight_hc_RUN);
_N = vfma_f16(_N, _weight_hc_N, _h_cont);
_gru_N = vfma_f16(_gru_N, _weight_hc_N, _h_cont);
weight_hc_RUN += 4;
}
_N = vadd_f16(_N, _sum4);
_gru_N = vadd_f16(_gru_N, _sum4);
_sum5 = vadd_f16(_sum5, _sum6);
_N = vadd_f16(_N, _sum5);
_gru_N = vadd_f16(_gru_N, _sum5);
_N = vfma_f16(vld1_f16(bias_c_RUBNWN + 12), vcvt_f16_f32(_R32), _N);
_gru_N = vfma_f16(vld1_f16(bias_c_RUBNWN + 12), vcvt_f16_f32(_R32), _gru_N);
_sum4 = vdup_n_f16((__fp16)0.f);
_sum5 = vdup_n_f16((__fp16)0.f);
_sum6 = vdup_n_f16((__fp16)0.f);
......@@ -525,13 +525,13 @@ static int gru_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
"fmla %5.4h, v3.4h, v4.h[3] \n"
: "=r"(x),
"=r"(weight_xc_RUN),
"=w"(_N),
"=w"(_gru_N),
"=w"(_sum4),
"=w"(_sum5),
"=w"(_sum6)
: "0"(x),
"1"(weight_xc_RUN),
"2"(_N),
"2"(_gru_N),
"3"(_sum4),
"4"(_sum5),
"5"(_sum6)
......@@ -543,17 +543,17 @@ static int gru_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float16x4_t _xi = vdup_n_f16(xi);
float16x4_t _weight_xc_N = vld1_f16(weight_xc_RUN);
_N = vfma_f16(_N, _weight_xc_N, _xi);
_gru_N = vfma_f16(_gru_N, _weight_xc_N, _xi);
weight_xc_RUN += 4;
}
_N = vadd_f16(_N, _sum4);
_gru_N = vadd_f16(_gru_N, _sum4);
_sum5 = vadd_f16(_sum5, _sum6);
_N = vadd_f16(_N, _sum5);
_gru_N = vadd_f16(_gru_N, _sum5);
// tanh(N)
float32x4_t _N32 = tanh_ps(vcvt_f32_f16(_N));
float32x4_t _N32 = tanh_ps(vcvt_f32_f16(_gru_N));
float* gates_data = gates.row(q / 4);
......@@ -645,13 +645,13 @@ static int gru_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
const float* gates_data = gates.row(q / 4);
float32x4_t _U = vld1q_f32(gates_data);
float32x4_t _N = vld1q_f32(gates_data + 4);
float32x4_t _gru_U = vld1q_f32(gates_data);
float32x4_t _gru_N = vld1q_f32(gates_data + 4);
float32x4_t _H = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _U), _N), vmulq_f32(_U, vld1q_f32(hidden_ptr + q)));
float32x4_t _gru_H = vaddq_f32(vmulq_f32(vsubq_f32(vdupq_n_f32(1.f), _gru_U), _gru_N), vmulq_f32(_gru_U, vld1q_f32(hidden_ptr + q)));
vst1q_f32(hidden_ptr + q, _H);
vst1_f16(output_data + q, vcvt_f16_f32(_H));
vst1q_f32(hidden_ptr + q, _gru_H);
vst1_f16(output_data + q, vcvt_f16_f32(_gru_H));
}
#pragma omp parallel for num_threads(opt.num_threads)
for (int q = remain_num_output_start; q < num_output; q++)
......
......@@ -254,11 +254,11 @@ static void resize_bicubic_image_pack4(const Mat& src, Mat& dst, float* alpha, i
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _rows2 = vld1q_f32(rows2p);
float32x4_t _rows3 = vld1q_f32(rows3p);
float32x4_t _D = vmulq_lane_f32(_rows0, vget_low_f32(_b0123), 0);
_D = vmlaq_lane_f32(_D, _rows1, vget_low_f32(_b0123), 1);
_D = vmlaq_lane_f32(_D, _rows2, vget_high_f32(_b0123), 0);
_D = vmlaq_lane_f32(_D, _rows3, vget_high_f32(_b0123), 1);
vst1q_f32(Dp, _D);
float32x4_t _Dp = vmulq_lane_f32(_rows0, vget_low_f32(_b0123), 0);
_Dp = vmlaq_lane_f32(_Dp, _rows1, vget_low_f32(_b0123), 1);
_Dp = vmlaq_lane_f32(_Dp, _rows2, vget_high_f32(_b0123), 0);
_Dp = vmlaq_lane_f32(_Dp, _rows3, vget_high_f32(_b0123), 1);
vst1q_f32(Dp, _Dp);
Dp += 4;
rows0p += 4;
......
......@@ -254,11 +254,11 @@ static void resize_bicubic_image_pack4_bf16s(const Mat& src, Mat& dst, float* al
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _rows2 = vld1q_f32(rows2p);
float32x4_t _rows3 = vld1q_f32(rows3p);
float32x4_t _D = vmulq_lane_f32(_rows0, vget_low_f32(_b0123), 0);
_D = vmlaq_lane_f32(_D, _rows1, vget_low_f32(_b0123), 1);
_D = vmlaq_lane_f32(_D, _rows2, vget_high_f32(_b0123), 0);
_D = vmlaq_lane_f32(_D, _rows3, vget_high_f32(_b0123), 1);
vst1_u16(Dp, float2bfloat(_D));
float32x4_t _Dp = vmulq_lane_f32(_rows0, vget_low_f32(_b0123), 0);
_Dp = vmlaq_lane_f32(_Dp, _rows1, vget_low_f32(_b0123), 1);
_Dp = vmlaq_lane_f32(_Dp, _rows2, vget_high_f32(_b0123), 0);
_Dp = vmlaq_lane_f32(_Dp, _rows3, vget_high_f32(_b0123), 1);
vst1_u16(Dp, float2bfloat(_Dp));
Dp += 4;
rows0p += 4;
......
......@@ -253,11 +253,11 @@ static void resize_bicubic_image_pack4_fp16s(const Mat& src, Mat& dst, float* al
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _rows2 = vld1q_f32(rows2p);
float32x4_t _rows3 = vld1q_f32(rows3p);
float32x4_t _D = vmulq_laneq_f32(_rows0, _b0123, 0);
_D = vfmaq_laneq_f32(_D, _rows1, _b0123, 1);
_D = vfmaq_laneq_f32(_D, _rows2, _b0123, 2);
_D = vfmaq_laneq_f32(_D, _rows3, _b0123, 3);
vst1_f16(Dp, vcvt_f16_f32(_D));
float32x4_t _Dp = vmulq_laneq_f32(_rows0, _b0123, 0);
_Dp = vfmaq_laneq_f32(_Dp, _rows1, _b0123, 1);
_Dp = vfmaq_laneq_f32(_Dp, _rows2, _b0123, 2);
_Dp = vfmaq_laneq_f32(_Dp, _rows3, _b0123, 3);
vst1_f16(Dp, vcvt_f16_f32(_Dp));
Dp += 4;
rows0p += 4;
......@@ -511,11 +511,11 @@ static void resize_bicubic_image_pack4_fp16sa(const Mat& src, Mat& dst, __fp16*
float16x4_t _rows1 = vld1_f16(rows1p);
float16x4_t _rows2 = vld1_f16(rows2p);
float16x4_t _rows3 = vld1_f16(rows3p);
float16x4_t _D = vmul_lane_f16(_rows0, _b0123, 0);
_D = vfma_lane_f16(_D, _rows1, _b0123, 1);
_D = vfma_lane_f16(_D, _rows2, _b0123, 2);
_D = vfma_lane_f16(_D, _rows3, _b0123, 3);
vst1_f16(Dp, _D);
float16x4_t _Dp = vmul_lane_f16(_rows0, _b0123, 0);
_Dp = vfma_lane_f16(_Dp, _rows1, _b0123, 1);
_Dp = vfma_lane_f16(_Dp, _rows2, _b0123, 2);
_Dp = vfma_lane_f16(_Dp, _rows3, _b0123, 3);
vst1_f16(Dp, _Dp);
Dp += 4;
rows0p += 4;
......
......@@ -253,11 +253,11 @@ static void resize_bicubic_image_pack8_fp16sa(const Mat& src, Mat& dst, __fp16*
float16x8_t _rows1 = vld1q_f16(rows1p);
float16x8_t _rows2 = vld1q_f16(rows2p);
float16x8_t _rows3 = vld1q_f16(rows3p);
float16x8_t _D = vmulq_lane_f16(_rows0, _b0123, 0);
_D = vfmaq_lane_f16(_D, _rows1, _b0123, 1);
_D = vfmaq_lane_f16(_D, _rows2, _b0123, 2);
_D = vfmaq_lane_f16(_D, _rows3, _b0123, 3);
vst1q_f16(Dp, _D);
float16x8_t _Dp = vmulq_lane_f16(_rows0, _b0123, 0);
_Dp = vfmaq_lane_f16(_Dp, _rows1, _b0123, 1);
_Dp = vfmaq_lane_f16(_Dp, _rows2, _b0123, 2);
_Dp = vfmaq_lane_f16(_Dp, _rows3, _b0123, 3);
vst1q_f16(Dp, _Dp);
Dp += 8;
rows0p += 8;
......
......@@ -193,18 +193,18 @@ static void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* x
float32x4_t _rows0 = vld1q_f32(rows0p);
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _D = vmulq_f32(_rows0, _b0);
_D = vmlaq_f32(_D, _rows1, _b1);
float32x4_t _Dp = vmulq_f32(_rows0, _b0);
_Dp = vmlaq_f32(_Dp, _rows1, _b1);
vst1q_f32(Dp, _D);
vst1q_f32(Dp, _Dp);
float32x4_t _rows0n = vld1q_f32(rows0p + 4);
float32x4_t _rows1n = vld1q_f32(rows1p + 4);
float32x4_t _Dn = vmulq_f32(_rows0n, _b0);
_Dn = vmlaq_f32(_Dn, _rows1n, _b1);
float32x4_t _Dpn = vmulq_f32(_rows0n, _b0);
_Dpn = vmlaq_f32(_Dpn, _rows1n, _b1);
vst1q_f32(Dp + 4, _Dn);
vst1q_f32(Dp + 4, _Dpn);
Dp += 8;
rows0p += 8;
......
......@@ -106,18 +106,18 @@ static void resize_bilinear_image_bf16s(const Mat& src, Mat& dst, float* alpha,
float32x4_t _rows0 = vld1q_f32(rows0p);
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _D = vmulq_f32(_rows0, _b0);
_D = vmlaq_f32(_D, _rows1, _b1);
float32x4_t _Dp = vmulq_f32(_rows0, _b0);
_Dp = vmlaq_f32(_Dp, _rows1, _b1);
vst1_u16(Dp, float2bfloat(_D));
vst1_u16(Dp, float2bfloat(_Dp));
float32x4_t _rows0n = vld1q_f32(rows0p + 4);
float32x4_t _rows1n = vld1q_f32(rows1p + 4);
float32x4_t _Dn = vmulq_f32(_rows0n, _b0);
_Dn = vmlaq_f32(_Dn, _rows1n, _b1);
float32x4_t _Dpn = vmulq_f32(_rows0n, _b0);
_Dpn = vmlaq_f32(_Dpn, _rows1n, _b1);
vst1_u16(Dp + 4, float2bfloat(_Dn));
vst1_u16(Dp + 4, float2bfloat(_Dpn));
Dp += 8;
rows0p += 8;
......
......@@ -138,10 +138,10 @@ static void resize_bilinear_image_fp16s(const Mat& src, Mat& dst, float* alpha,
float32x4_t _rows0 = vld1q_f32(rows0p);
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _D = vmulq_f32(_rows0, _b0);
_D = vfmaq_f32(_D, _rows1, _b1);
float32x4_t _Dp = vmulq_f32(_rows0, _b0);
_Dp = vfmaq_f32(_Dp, _rows1, _b1);
vst1_f16(Dp, vcvt_f16_f32(_D));
vst1_f16(Dp, vcvt_f16_f32(_Dp));
float32x4_t _rows0n = vld1q_f32(rows0p + 4);
float32x4_t _rows1n = vld1q_f32(rows1p + 4);
......@@ -254,10 +254,10 @@ static void resize_bilinear_image_fp16sa(const Mat& src, Mat& dst, __fp16* alpha
float16x8_t _rows0 = vld1q_f16(rows0p);
float16x8_t _rows1 = vld1q_f16(rows1p);
float16x8_t _D = vmulq_f16(_rows0, _b0);
_D = vfmaq_f16(_D, _rows1, _b1);
float16x8_t _Dp = vmulq_f16(_rows0, _b0);
_Dp = vfmaq_f16(_Dp, _rows1, _b1);
vst1q_f16(Dp, _D);
vst1q_f16(Dp, _Dp);
Dp += 8;
rows0p += 8;
......
......@@ -106,9 +106,9 @@ static void resize_bilinear_image_pack4(const Mat& src, Mat& dst, float* alpha,
{
float32x4_t _rows0 = vld1q_f32(rows0p);
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _D = vmulq_lane_f32(_rows0, _b01, 0);
_D = vmlaq_lane_f32(_D, _rows1, _b01, 1);
vst1q_f32(Dp, _D);
float32x4_t _Dp = vmulq_lane_f32(_rows0, _b01, 0);
_Dp = vmlaq_lane_f32(_Dp, _rows1, _b01, 1);
vst1q_f32(Dp, _Dp);
Dp += 4;
rows0p += 4;
......
......@@ -106,9 +106,9 @@ static void resize_bilinear_image_pack4_bf16s(const Mat& src, Mat& dst, float* a
{
float32x4_t _rows0 = vld1q_f32(rows0p);
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _D = vmulq_lane_f32(_rows0, _b01, 0);
_D = vmlaq_lane_f32(_D, _rows1, _b01, 1);
vst1_u16(Dp, float2bfloat(_D));
float32x4_t _Dp = vmulq_lane_f32(_rows0, _b01, 0);
_Dp = vmlaq_lane_f32(_Dp, _rows1, _b01, 1);
vst1_u16(Dp, float2bfloat(_Dp));
Dp += 4;
rows0p += 4;
......
......@@ -106,9 +106,9 @@ static void resize_bilinear_image_pack4_fp16s(const Mat& src, Mat& dst, float* a
{
float32x4_t _rows0 = vld1q_f32(rows0p);
float32x4_t _rows1 = vld1q_f32(rows1p);
float32x4_t _D = vmulq_lane_f32(_rows0, _b01, 0);
_D = vmlaq_lane_f32(_D, _rows1, _b01, 1);
vst1_f16(Dp, vcvt_f16_f32(_D));
float32x4_t _Dp = vmulq_lane_f32(_rows0, _b01, 0);
_Dp = vmlaq_lane_f32(_Dp, _rows1, _b01, 1);
vst1_f16(Dp, vcvt_f16_f32(_Dp));
Dp += 4;
rows0p += 4;
......@@ -213,9 +213,9 @@ static void resize_bilinear_image_pack4_fp16sa(const Mat& src, Mat& dst, __fp16*
{
float16x4_t _rows0 = vld1_f16(rows0p);
float16x4_t _rows1 = vld1_f16(rows1p);
float16x4_t _D = vmul_lane_f16(_rows0, _b01, 0);
_D = vfma_lane_f16(_D, _rows1, _b01, 1);
vst1_f16(Dp, _D);
float16x4_t _Dp = vmul_lane_f16(_rows0, _b01, 0);
_Dp = vfma_lane_f16(_Dp, _rows1, _b01, 1);
vst1_f16(Dp, _Dp);
Dp += 4;
rows0p += 4;
......
......@@ -106,9 +106,9 @@ static void resize_bilinear_image_pack8_fp16sa(const Mat& src, Mat& dst, __fp16*
{
float16x8_t _rows0 = vld1q_f16(rows0p);
float16x8_t _rows1 = vld1q_f16(rows1p);
float16x8_t _D = vmulq_lane_f16(_rows0, _b01, 0);
_D = vfmaq_lane_f16(_D, _rows1, _b01, 1);
vst1q_f16(Dp, _D);
float16x8_t _Dp = vmulq_lane_f16(_rows0, _b01, 0);
_Dp = vfmaq_lane_f16(_Dp, _rows1, _b01, 1);
vst1q_f16(Dp, _Dp);
Dp += 8;
rows0p += 8;
......
......@@ -323,24 +323,24 @@ static int lstm(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& w
float32x4x4_t _IFOG_4x4 = vld4q_f32(gates_data);
float32x4_t _I = sigmoid_ps(_IFOG_4x4.val[0]);
float32x4_t _F = sigmoid_ps(_IFOG_4x4.val[1]);
float32x4_t _O = sigmoid_ps(_IFOG_4x4.val[2]);
float32x4_t _G = tanh_ps(_IFOG_4x4.val[3]);
float32x4_t _lstm_I = sigmoid_ps(_IFOG_4x4.val[0]);
float32x4_t _lstm_F = sigmoid_ps(_IFOG_4x4.val[1]);
float32x4_t _lstm_O = sigmoid_ps(_IFOG_4x4.val[2]);
float32x4_t _lstm_G = tanh_ps(_IFOG_4x4.val[3]);
float32x4_t _cell2 = vaddq_f32(vmulq_f32(_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_I, _G));
float32x4_t _H = vmulq_f32(_O, tanh_ps(_cell2));
float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));
float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));
vst1q_f32(cell_ptr + q, _cell2);
if (num_output == hidden_size)
{
vst1q_f32(hidden_ptr + q, _H);
vst1q_f32(output_data + q, _H);
vst1q_f32(hidden_ptr + q, _lstm_H);
vst1q_f32(output_data + q, _lstm_H);
}
else
{
vst1q_f32(tmp_hidden_ptr + q, _H);
vst1q_f32(tmp_hidden_ptr + q, _lstm_H);
}
}
#endif // __ARM_NEON
......@@ -778,24 +778,24 @@ static int lstm_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float32x4x4_t _IFOG_4x4 = vld4q_f32(gates_data);
float32x4_t _I = sigmoid_ps(_IFOG_4x4.val[0]);
float32x4_t _F = sigmoid_ps(_IFOG_4x4.val[1]);
float32x4_t _O = sigmoid_ps(_IFOG_4x4.val[2]);
float32x4_t _G = tanh_ps(_IFOG_4x4.val[3]);
float32x4_t _lstm_I = sigmoid_ps(_IFOG_4x4.val[0]);
float32x4_t _lstm_F = sigmoid_ps(_IFOG_4x4.val[1]);
float32x4_t _lstm_O = sigmoid_ps(_IFOG_4x4.val[2]);
float32x4_t _lstm_G = tanh_ps(_IFOG_4x4.val[3]);
float32x4_t _cell2 = vaddq_f32(vmulq_f32(_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_I, _G));
float32x4_t _H = vmulq_f32(_O, tanh_ps(_cell2));
float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));
float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));
vst1q_f32(cell_ptr + q, _cell2);
if (num_output == hidden_size)
{
vst1q_f32(hidden_ptr + q, _H);
vst1_u16(output_data + q, float2bfloat(_H));
vst1q_f32(hidden_ptr + q, _lstm_H);
vst1_u16(output_data + q, float2bfloat(_lstm_H));
}
else
{
vst1q_f32(tmp_hidden_ptr + q, _H);
vst1q_f32(tmp_hidden_ptr + q, _lstm_H);
}
}
#endif // __ARM_NEON
......
......@@ -163,24 +163,24 @@ static int lstm_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float32x4x4_t _IFOG_4x4 = vld4q_f32(gates_data);
float32x4_t _I = sigmoid_ps(_IFOG_4x4.val[0]);
float32x4_t _F = sigmoid_ps(_IFOG_4x4.val[1]);
float32x4_t _O = sigmoid_ps(_IFOG_4x4.val[2]);
float32x4_t _G = tanh_ps(_IFOG_4x4.val[3]);
float32x4_t _lstm_I = sigmoid_ps(_IFOG_4x4.val[0]);
float32x4_t _lstm_F = sigmoid_ps(_IFOG_4x4.val[1]);
float32x4_t _lstm_O = sigmoid_ps(_IFOG_4x4.val[2]);
float32x4_t _lstm_G = tanh_ps(_IFOG_4x4.val[3]);
float32x4_t _cell2 = vaddq_f32(vmulq_f32(_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_I, _G));
float32x4_t _H = vmulq_f32(_O, tanh_ps(_cell2));
float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));
float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));
vst1q_f32(cell_ptr + q, _cell2);
if (num_output == hidden_size)
{
vst1q_f32(hidden_ptr + q, _H);
vst1_f16(output_data + q, vcvt_f16_f32(_H));
vst1q_f32(hidden_ptr + q, _lstm_H);
vst1_f16(output_data + q, vcvt_f16_f32(_lstm_H));
}
else
{
vst1q_f32(tmp_hidden_ptr + q, _H);
vst1q_f32(tmp_hidden_ptr + q, _lstm_H);
}
}
#pragma omp parallel for num_threads(opt.num_threads)
......@@ -503,24 +503,24 @@ static int lstm_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float16x4x4_t _IFOG_4x4 = vld4_f16(gates_data);
float32x4_t _I = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[0]));
float32x4_t _F = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[1]));
float32x4_t _O = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[2]));
float32x4_t _G = tanh_ps(vcvt_f32_f16(_IFOG_4x4.val[3]));
float32x4_t _lstm_I = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[0]));
float32x4_t _lstm_F = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[1]));
float32x4_t _lstm_O = sigmoid_ps(vcvt_f32_f16(_IFOG_4x4.val[2]));
float32x4_t _lstm_G = tanh_ps(vcvt_f32_f16(_IFOG_4x4.val[3]));
float32x4_t _cell2 = vaddq_f32(vmulq_f32(_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_I, _G));
float32x4_t _H = vmulq_f32(_O, tanh_ps(_cell2));
float32x4_t _cell2 = vaddq_f32(vmulq_f32(_lstm_F, vld1q_f32(cell_ptr + q)), vmulq_f32(_lstm_I, _lstm_G));
float32x4_t _lstm_H = vmulq_f32(_lstm_O, tanh_ps(_cell2));
vst1q_f32(cell_ptr + q, _cell2);
if (num_output == hidden_size)
{
vst1q_f32(hidden_ptr + q, _H);
vst1_f16(output_data + q, vcvt_f16_f32(_H));
vst1q_f32(hidden_ptr + q, _lstm_H);
vst1_f16(output_data + q, vcvt_f16_f32(_lstm_H));
}
else
{
vst1q_f32(tmp_hidden_ptr + q, _H);
vst1q_f32(tmp_hidden_ptr + q, _lstm_H);
}
}
#pragma omp parallel for num_threads(opt.num_threads)
......
......@@ -176,7 +176,7 @@ static int rnn(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& we
const float* weight_xc_ptr = weight_xc.row(q / 4);
const float* weight_hc_ptr = weight_hc.row(q / 4);
float32x4_t _H = vld1q_f32((const float*)bias_c + q);
float32x4_t _rnn_H = vld1q_f32((const float*)bias_c + q);
float32x4_t _sum1 = vdupq_n_f32(0.f);
float32x4_t _sum2 = vdupq_n_f32(0.f);
float32x4_t _sum3 = vdupq_n_f32(0.f);
......@@ -190,12 +190,12 @@ static int rnn(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& we
float32x4_t _weight_xc_2 = vld1q_f32(weight_xc_ptr + 8);
float32x4_t _weight_xc_3 = vld1q_f32(weight_xc_ptr + 12);
#if __aarch64__
_H = vfmaq_laneq_f32(_H, _weight_xc, _x, 0);
_rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_xc, _x, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_1, _x, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_2, _x, 2);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_3, _x, 3);
#else
_H = vmlaq_lane_f32(_H, _weight_xc, vget_low_f32(_x), 0);
_rnn_H = vmlaq_lane_f32(_rnn_H, _weight_xc, vget_low_f32(_x), 0);
_sum1 = vmlaq_lane_f32(_sum1, _weight_xc_1, vget_low_f32(_x), 1);
_sum2 = vmlaq_lane_f32(_sum2, _weight_xc_2, vget_high_f32(_x), 0);
_sum3 = vmlaq_lane_f32(_sum3, _weight_xc_3, vget_high_f32(_x), 1);
......@@ -207,7 +207,7 @@ static int rnn(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& we
{
float32x4_t _x = vdupq_n_f32(x[i]);
float32x4_t _weight_xc = vld1q_f32(weight_xc_ptr);
_H = vmlaq_f32(_H, _weight_xc, _x);
_rnn_H = vmlaq_f32(_rnn_H, _weight_xc, _x);
weight_xc_ptr += 4;
}
......@@ -221,12 +221,12 @@ static int rnn(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& we
float32x4_t _weight_hc_2 = vld1q_f32(weight_hc_ptr + 8);
float32x4_t _weight_hc_3 = vld1q_f32(weight_hc_ptr + 12);
#if __aarch64__
_H = vfmaq_laneq_f32(_H, _weight_hc, _hidden_state, 0);
_rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_hc, _hidden_state, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_1, _hidden_state, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_2, _hidden_state, 2);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_3, _hidden_state, 3);
#else
_H = vmlaq_lane_f32(_H, _weight_hc, vget_low_f32(_hidden_state), 0);
_rnn_H = vmlaq_lane_f32(_rnn_H, _weight_hc, vget_low_f32(_hidden_state), 0);
_sum1 = vmlaq_lane_f32(_sum1, _weight_hc_1, vget_low_f32(_hidden_state), 1);
_sum2 = vmlaq_lane_f32(_sum2, _weight_hc_2, vget_high_f32(_hidden_state), 0);
_sum3 = vmlaq_lane_f32(_sum3, _weight_hc_3, vget_high_f32(_hidden_state), 1);
......@@ -238,18 +238,18 @@ static int rnn(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& we
{
float32x4_t _hidden_state = vdupq_n_f32(hidden_state[i]);
float32x4_t _weight_hc = vld1q_f32(weight_hc_ptr);
_H = vmlaq_f32(_H, _weight_hc, _hidden_state);
_rnn_H = vmlaq_f32(_rnn_H, _weight_hc, _hidden_state);
weight_hc_ptr += 4;
}
_H = vaddq_f32(_H, _sum1);
_rnn_H = vaddq_f32(_rnn_H, _sum1);
_sum2 = vaddq_f32(_sum2, _sum3);
_H = vaddq_f32(_H, _sum2);
_rnn_H = vaddq_f32(_rnn_H, _sum2);
_H = tanh_ps(_H);
_rnn_H = tanh_ps(_rnn_H);
vst1q_f32((float*)gates + q, _H);
vst1q_f32((float*)gates + q, _rnn_H);
}
#endif // __ARM_NEON
#pragma omp parallel for num_threads(opt.num_threads)
......@@ -293,10 +293,10 @@ static int rnn(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& we
{
int q = qq * 4;
float32x4_t _H = vld1q_f32((float*)gates + q);
float32x4_t _rnn_H = vld1q_f32((float*)gates + q);
vst1q_f32(hidden_ptr + q, _H);
vst1q_f32(output_data + q, _H);
vst1q_f32(hidden_ptr + q, _rnn_H);
vst1q_f32(output_data + q, _rnn_H);
}
#endif // __ARM_NEON
#pragma omp parallel for num_threads(opt.num_threads)
......@@ -511,7 +511,7 @@ static int rnn_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
const unsigned short* weight_xc_ptr = weight_xc.row<const unsigned short>(q / 4);
const unsigned short* weight_hc_ptr = weight_hc.row<const unsigned short>(q / 4);
float32x4_t _H = bfloat2float(vld1_u16((const unsigned short*)bias_c + q));
float32x4_t _rnn_H = bfloat2float(vld1_u16((const unsigned short*)bias_c + q));
float32x4_t _sum1 = vdupq_n_f32(0.f);
float32x4_t _sum2 = vdupq_n_f32(0.f);
float32x4_t _sum3 = vdupq_n_f32(0.f);
......@@ -525,12 +525,12 @@ static int rnn_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _weight_xc_2 = bfloat2float(vld1_u16(weight_xc_ptr + 8));
float32x4_t _weight_xc_3 = bfloat2float(vld1_u16(weight_xc_ptr + 12));
#if __aarch64__
_H = vfmaq_laneq_f32(_H, _weight_xc, _x, 0);
_rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_xc, _x, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_1, _x, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_2, _x, 2);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_3, _x, 3);
#else
_H = vmlaq_lane_f32(_H, _weight_xc, vget_low_f32(_x), 0);
_rnn_H = vmlaq_lane_f32(_rnn_H, _weight_xc, vget_low_f32(_x), 0);
_sum1 = vmlaq_lane_f32(_sum1, _weight_xc_1, vget_low_f32(_x), 1);
_sum2 = vmlaq_lane_f32(_sum2, _weight_xc_2, vget_high_f32(_x), 0);
_sum3 = vmlaq_lane_f32(_sum3, _weight_xc_3, vget_high_f32(_x), 1);
......@@ -542,7 +542,7 @@ static int rnn_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
{
float32x4_t _x = bfloat2float(vdup_n_u16(x[i]));
float32x4_t _weight_xc = bfloat2float(vld1_u16(weight_xc_ptr));
_H = vmlaq_f32(_H, _weight_xc, _x);
_rnn_H = vmlaq_f32(_rnn_H, _weight_xc, _x);
weight_xc_ptr += 4;
}
......@@ -556,12 +556,12 @@ static int rnn_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _weight_hc_2 = bfloat2float(vld1_u16(weight_hc_ptr + 8));
float32x4_t _weight_hc_3 = bfloat2float(vld1_u16(weight_hc_ptr + 12));
#if __aarch64__
_H = vfmaq_laneq_f32(_H, _weight_hc, _hidden_state, 0);
_rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_hc, _hidden_state, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_1, _hidden_state, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_2, _hidden_state, 2);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_3, _hidden_state, 3);
#else
_H = vmlaq_lane_f32(_H, _weight_hc, vget_low_f32(_hidden_state), 0);
_rnn_H = vmlaq_lane_f32(_rnn_H, _weight_hc, vget_low_f32(_hidden_state), 0);
_sum1 = vmlaq_lane_f32(_sum1, _weight_hc_1, vget_low_f32(_hidden_state), 1);
_sum2 = vmlaq_lane_f32(_sum2, _weight_hc_2, vget_high_f32(_hidden_state), 0);
_sum3 = vmlaq_lane_f32(_sum3, _weight_hc_3, vget_high_f32(_hidden_state), 1);
......@@ -573,18 +573,18 @@ static int rnn_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
{
float32x4_t _hidden_state = vdupq_n_f32(hidden_state[i]);
float32x4_t _weight_hc = bfloat2float(vld1_u16(weight_hc_ptr));
_H = vmlaq_f32(_H, _weight_hc, _hidden_state);
_rnn_H = vmlaq_f32(_rnn_H, _weight_hc, _hidden_state);
weight_hc_ptr += 4;
}
_H = vaddq_f32(_H, _sum1);
_rnn_H = vaddq_f32(_rnn_H, _sum1);
_sum2 = vaddq_f32(_sum2, _sum3);
_H = vaddq_f32(_H, _sum2);
_rnn_H = vaddq_f32(_rnn_H, _sum2);
_H = tanh_ps(_H);
_rnn_H = tanh_ps(_rnn_H);
vst1q_f32((float*)gates + q, _H);
vst1q_f32((float*)gates + q, _rnn_H);
}
#endif // __ARM_NEON
#pragma omp parallel for num_threads(opt.num_threads)
......@@ -628,10 +628,10 @@ static int rnn_bf16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
{
int q = qq * 4;
float32x4_t _H = vld1q_f32((float*)gates + q);
float32x4_t _rnn_H = vld1q_f32((float*)gates + q);
vst1q_f32(hidden_ptr + q, _H);
vst1_u16(output_data + q, float2bfloat(_H));
vst1q_f32(hidden_ptr + q, _rnn_H);
vst1_u16(output_data + q, float2bfloat(_rnn_H));
}
#endif // __ARM_NEON
#pragma omp parallel for num_threads(opt.num_threads)
......
......@@ -54,7 +54,7 @@ static int rnn_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
const __fp16* weight_xc_ptr = weight_xc.row<const __fp16>(q / 4);
const __fp16* weight_hc_ptr = weight_hc.row<const __fp16>(q / 4);
float32x4_t _H = vcvt_f32_f16(vld1_f16((const __fp16*)bias_c + q));
float32x4_t _rnn_H = vcvt_f32_f16(vld1_f16((const __fp16*)bias_c + q));
float32x4_t _sum1 = vdupq_n_f32(0.f);
float32x4_t _sum2 = vdupq_n_f32(0.f);
float32x4_t _sum3 = vdupq_n_f32(0.f);
......@@ -67,7 +67,7 @@ static int rnn_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _weight_xc_1 = vcvt_f32_f16(vld1_f16(weight_xc_ptr + 4));
float32x4_t _weight_xc_2 = vcvt_f32_f16(vld1_f16(weight_xc_ptr + 8));
float32x4_t _weight_xc_3 = vcvt_f32_f16(vld1_f16(weight_xc_ptr + 12));
_H = vfmaq_laneq_f32(_H, _weight_xc, _x, 0);
_rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_xc, _x, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_xc_1, _x, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_xc_2, _x, 2);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_xc_3, _x, 3);
......@@ -78,7 +78,7 @@ static int rnn_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
{
float32x4_t _x = vcvt_f32_f16(vdup_n_f16(x[i]));
float32x4_t _weight_xc = vcvt_f32_f16(vld1_f16(weight_xc_ptr));
_H = vfmaq_f32(_H, _weight_xc, _x);
_rnn_H = vfmaq_f32(_rnn_H, _weight_xc, _x);
weight_xc_ptr += 4;
}
......@@ -91,7 +91,7 @@ static int rnn_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
float32x4_t _weight_hc_1 = vcvt_f32_f16(vld1_f16(weight_hc_ptr + 4));
float32x4_t _weight_hc_2 = vcvt_f32_f16(vld1_f16(weight_hc_ptr + 8));
float32x4_t _weight_hc_3 = vcvt_f32_f16(vld1_f16(weight_hc_ptr + 12));
_H = vfmaq_laneq_f32(_H, _weight_hc, _hidden_state, 0);
_rnn_H = vfmaq_laneq_f32(_rnn_H, _weight_hc, _hidden_state, 0);
_sum1 = vfmaq_laneq_f32(_sum1, _weight_hc_1, _hidden_state, 1);
_sum2 = vfmaq_laneq_f32(_sum2, _weight_hc_2, _hidden_state, 2);
_sum3 = vfmaq_laneq_f32(_sum3, _weight_hc_3, _hidden_state, 3);
......@@ -102,18 +102,18 @@ static int rnn_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
{
float32x4_t _hidden_state = vdupq_n_f32(hidden_state[i]);
float32x4_t _weight_hc = vcvt_f32_f16(vld1_f16(weight_hc_ptr));
_H = vfmaq_f32(_H, _weight_hc, _hidden_state);
_rnn_H = vfmaq_f32(_rnn_H, _weight_hc, _hidden_state);
weight_hc_ptr += 4;
}
_H = vaddq_f32(_H, _sum1);
_rnn_H = vaddq_f32(_rnn_H, _sum1);
_sum2 = vaddq_f32(_sum2, _sum3);
_H = vaddq_f32(_H, _sum2);
_rnn_H = vaddq_f32(_rnn_H, _sum2);
_H = tanh_ps(_H);
_rnn_H = tanh_ps(_rnn_H);
vst1q_f32((float*)gates + q, _H);
vst1q_f32((float*)gates + q, _rnn_H);
}
#pragma omp parallel for num_threads(opt.num_threads)
for (int q = remain_num_output_start; q < num_output; q++)
......@@ -149,10 +149,10 @@ static int rnn_fp16s(const Mat& bottom_blob, Mat& top_blob, int reverse, const M
{
int q = qq * 4;
float32x4_t _H = vld1q_f32((float*)gates + q);
float32x4_t _rnn_H = vld1q_f32((float*)gates + q);
vst1q_f32(hidden_ptr + q, _H);
vst1_f16(output_data + q, vcvt_f16_f32(_H));
vst1q_f32(hidden_ptr + q, _rnn_H);
vst1_f16(output_data + q, vcvt_f16_f32(_rnn_H));
}
#pragma omp parallel for num_threads(opt.num_threads)
for (int q = remain_num_output_start; q < num_output; q++)
......@@ -196,7 +196,7 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
const __fp16* weight_xc_ptr = weight_xc.row<const __fp16>(q / 8);
const __fp16* weight_hc_ptr = weight_hc.row<const __fp16>(q / 8);
float16x8_t _H = vld1q_f16((const __fp16*)bias_c + q);
float16x8_t _rnn_H = vld1q_f16((const __fp16*)bias_c + q);
float16x8_t _sum1 = vdupq_n_f16(0.f);
float16x8_t _sum2 = vdupq_n_f16(0.f);
float16x8_t _sum3 = vdupq_n_f16(0.f);
......@@ -209,7 +209,7 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float16x8_t _weight_xc_1 = vld1q_f16(weight_xc_ptr + 8);
float16x8_t _weight_xc_2 = vld1q_f16(weight_xc_ptr + 16);
float16x8_t _weight_xc_3 = vld1q_f16(weight_xc_ptr + 24);
_H = vfmaq_lane_f16(_H, _weight_xc, _x, 0);
_rnn_H = vfmaq_lane_f16(_rnn_H, _weight_xc, _x, 0);
_sum1 = vfmaq_lane_f16(_sum1, _weight_xc_1, _x, 1);
_sum2 = vfmaq_lane_f16(_sum2, _weight_xc_2, _x, 2);
_sum3 = vfmaq_lane_f16(_sum3, _weight_xc_3, _x, 3);
......@@ -220,7 +220,7 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
{
float16x8_t _x = vdupq_n_f16(x[i]);
float16x8_t _weight_xc = vld1q_f16(weight_xc_ptr);
_H = vfmaq_f16(_H, _weight_xc, _x);
_rnn_H = vfmaq_f16(_rnn_H, _weight_xc, _x);
weight_xc_ptr += 8;
}
......@@ -233,7 +233,7 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float16x8_t _weight_hc_1 = vld1q_f16(weight_hc_ptr + 8);
float16x8_t _weight_hc_2 = vld1q_f16(weight_hc_ptr + 16);
float16x8_t _weight_hc_3 = vld1q_f16(weight_hc_ptr + 24);
_H = vfmaq_lane_f16(_H, _weight_hc, _hidden_state, 0);
_rnn_H = vfmaq_lane_f16(_rnn_H, _weight_hc, _hidden_state, 0);
_sum1 = vfmaq_lane_f16(_sum1, _weight_hc_1, _hidden_state, 1);
_sum2 = vfmaq_lane_f16(_sum2, _weight_hc_2, _hidden_state, 2);
_sum3 = vfmaq_lane_f16(_sum3, _weight_hc_3, _hidden_state, 3);
......@@ -244,17 +244,17 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
{
float16x8_t _hidden_state = vdupq_n_f16((__fp16)hidden_state[i]);
float16x8_t _weight_hc = vld1q_f16(weight_hc_ptr);
_H = vfmaq_f16(_H, _weight_hc, _hidden_state);
_rnn_H = vfmaq_f16(_rnn_H, _weight_hc, _hidden_state);
weight_hc_ptr += 8;
}
_H = vaddq_f16(_H, _sum1);
_rnn_H = vaddq_f16(_rnn_H, _sum1);
_sum2 = vaddq_f16(_sum2, _sum3);
_H = vaddq_f16(_H, _sum2);
_rnn_H = vaddq_f16(_rnn_H, _sum2);
float32x4_t _H32low = tanh_ps(vcvt_f32_f16(vget_low_f16(_H)));
float32x4_t _H32high = tanh_ps(vcvt_f32_f16(vget_high_f16(_H)));
float32x4_t _H32low = tanh_ps(vcvt_f32_f16(vget_low_f16(_rnn_H)));
float32x4_t _H32high = tanh_ps(vcvt_f32_f16(vget_high_f16(_rnn_H)));
vst1q_f32((float*)gates + q, _H32low);
vst1q_f32((float*)gates + q + 4, _H32high);
......@@ -268,7 +268,7 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
const __fp16* weight_xc_ptr = weight_xc.row<const __fp16>(q / 8 + (q % 8) / 4);
const __fp16* weight_hc_ptr = weight_hc.row<const __fp16>(q / 8 + (q % 8) / 4);
float16x4_t _H = vld1_f16((const __fp16*)bias_c + q);
float16x4_t _rnn_H = vld1_f16((const __fp16*)bias_c + q);
float16x4_t _sum1 = vdup_n_f16(0.f);
float16x4_t _sum2 = vdup_n_f16(0.f);
float16x4_t _sum3 = vdup_n_f16(0.f);
......@@ -281,7 +281,7 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float16x4_t _weight_xc_1 = vld1_f16(weight_xc_ptr + 4);
float16x4_t _weight_xc_2 = vld1_f16(weight_xc_ptr + 8);
float16x4_t _weight_xc_3 = vld1_f16(weight_xc_ptr + 12);
_H = vfma_lane_f16(_H, _weight_xc, _x, 0);
_rnn_H = vfma_lane_f16(_rnn_H, _weight_xc, _x, 0);
_sum1 = vfma_lane_f16(_sum1, _weight_xc_1, _x, 1);
_sum2 = vfma_lane_f16(_sum2, _weight_xc_2, _x, 2);
_sum3 = vfma_lane_f16(_sum3, _weight_xc_3, _x, 3);
......@@ -292,7 +292,7 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
{
float16x4_t _x = vdup_n_f16(x[i]);
float16x4_t _weight_xc = vld1_f16(weight_xc_ptr);
_H = vfma_f16(_H, _weight_xc, _x);
_rnn_H = vfma_f16(_rnn_H, _weight_xc, _x);
weight_xc_ptr += 4;
}
......@@ -305,7 +305,7 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
float16x4_t _weight_hc_1 = vld1_f16(weight_hc_ptr + 4);
float16x4_t _weight_hc_2 = vld1_f16(weight_hc_ptr + 8);
float16x4_t _weight_hc_3 = vld1_f16(weight_hc_ptr + 12);
_H = vfma_lane_f16(_H, _weight_hc, _hidden_state, 0);
_rnn_H = vfma_lane_f16(_rnn_H, _weight_hc, _hidden_state, 0);
_sum1 = vfma_lane_f16(_sum1, _weight_hc_1, _hidden_state, 1);
_sum2 = vfma_lane_f16(_sum2, _weight_hc_2, _hidden_state, 2);
_sum3 = vfma_lane_f16(_sum3, _weight_hc_3, _hidden_state, 3);
......@@ -316,16 +316,16 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
{
float16x4_t _hidden_state = vdup_n_f16((__fp16)hidden_state[i]);
float16x4_t _weight_hc = vld1_f16(weight_hc_ptr);
_H = vfma_f16(_H, _weight_hc, _hidden_state);
_rnn_H = vfma_f16(_rnn_H, _weight_hc, _hidden_state);
weight_hc_ptr += 4;
}
_H = vadd_f16(_H, _sum1);
_rnn_H = vadd_f16(_rnn_H, _sum1);
_sum2 = vadd_f16(_sum2, _sum3);
_H = vadd_f16(_H, _sum2);
_rnn_H = vadd_f16(_rnn_H, _sum2);
float32x4_t _H32 = tanh_ps(vcvt_f32_f16(_H));
float32x4_t _H32 = tanh_ps(vcvt_f32_f16(_rnn_H));
vst1q_f32((float*)gates + q, _H32);
}
......@@ -364,10 +364,10 @@ static int rnn_fp16sa(const Mat& bottom_blob, Mat& top_blob, int reverse, const
{
int q = qq * 4;
float32x4_t _H = vld1q_f32((float*)gates + q);
float32x4_t _rnn_H = vld1q_f32((float*)gates + q);
vst1q_f32(hidden_ptr + q, _H);
vst1_f16(output_data + q, vcvt_f16_f32(_H));
vst1q_f32(hidden_ptr + q, _rnn_H);
vst1_f16(output_data + q, vcvt_f16_f32(_rnn_H));
}
#pragma omp parallel for num_threads(opt.num_threads)
for (int q = remain_num_output_start; q < num_output; q++)
......
......@@ -287,13 +287,28 @@ int BinaryOp::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& to
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w);
{
if (A.w == B.h)
A2 = A.reshape(1, A.w);
else // if (A.w == B.w)
A2 = A.reshape(A.w, 1);
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w);
{
if (A.w == B.c)
A2 = A.reshape(1, 1, A.w);
else // if (A.w == B.w)
A2 = A.reshape(A.w, 1, 1);
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w);
{
if (A.w == B.c)
A2 = A.reshape(1, 1, 1, A.w);
else // if (A.w == B.w)
A2 = A.reshape(A.w, 1, 1, 1);
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h);
if (outdims == 4 && A.dims == 3)
......@@ -303,13 +318,28 @@ int BinaryOp::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>& to
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w);
{
if (B.w == A.h)
B2 = B.reshape(1, B.w);
else // if (B.w == A.w)
B2 = B.reshape(B.w, 1);
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w);
{
if (B.w == A.c)
B2 = B.reshape(1, 1, B.w);
else // if (B.w == A.w)
B2 = B.reshape(B.w, 1, 1);
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w);
{
if (B.w == A.c)
B2 = B.reshape(1, 1, 1, B.w);
else // if (B.w == A.w)
B2 = B.reshape(B.w, 1, 1, 1);
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h);
if (outdims == 4 && B.dims == 3)
......
......@@ -500,13 +500,46 @@ int BinaryOp_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vecto
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.h * B.elempack)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 2;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 3;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 4;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 3)
......@@ -516,13 +549,46 @@ int BinaryOp_loongarch::forward(const std::vector<Mat>& bottom_blobs, std::vecto
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.h * A.elempack)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 2;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 3;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 4;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 3)
......
......@@ -268,11 +268,11 @@ static void resize_bicubic_image_pack4(const Mat& src, Mat& dst, float* alpha, i
__m128 _rows1 = (__m128)__lsx_vld(rows1p, 0);
__m128 _rows2 = (__m128)__lsx_vld(rows2p, 0);
__m128 _rows3 = (__m128)__lsx_vld(rows3p, 0);
__m128 _D = __lsx_vfmul_s(_rows0, _b0);
_D = __lsx_vfmadd_s(_b1, _rows1, _D);
_D = __lsx_vfmadd_s(_b2, _rows2, _D);
_D = __lsx_vfmadd_s(_b3, _rows3, _D);
__lsx_vst(_D, Dp, 0);
__m128 _Dp = __lsx_vfmul_s(_rows0, _b0);
_Dp = __lsx_vfmadd_s(_b1, _rows1, _Dp);
_Dp = __lsx_vfmadd_s(_b2, _rows2, _Dp);
_Dp = __lsx_vfmadd_s(_b3, _rows3, _Dp);
__lsx_vst(_Dp, Dp, 0);
Dp += 4;
rows0p += 4;
......
......@@ -143,18 +143,18 @@ static void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* x
__m128 _rows0 = (__m128)__lsx_vld(rows0p, 0);
__m128 _rows1 = (__m128)__lsx_vld(rows1p, 0);
__m128 _D = __lsx_vfmul_s(_rows0, _b0);
_D = __lsx_vfmadd_s(_b1, _rows1, _D);
__m128 _Dp = __lsx_vfmul_s(_rows0, _b0);
_Dp = __lsx_vfmadd_s(_b1, _rows1, _Dp);
__lsx_vst(_D, Dp, 0);
__lsx_vst(_Dp, Dp, 0);
__m128 _rows0n = (__m128)__lsx_vld(rows0p + 4, 0);
__m128 _rows1n = (__m128)__lsx_vld(rows1p + 4, 0);
__m128 _Dn = __lsx_vfmul_s(_rows0n, _b0);
_Dn = __lsx_vfmadd_s(_b1, _rows1n, _Dn);
__m128 _Dpn = __lsx_vfmul_s(_rows0n, _b0);
_Dpn = __lsx_vfmadd_s(_b1, _rows1n, _Dpn);
__lsx_vst(_Dn, Dp + 4, 0);
__lsx_vst(_Dpn, Dp + 4, 0);
Dp += 8;
rows0p += 8;
......
......@@ -109,9 +109,9 @@ static void resize_bilinear_image_pack4(const Mat& src, Mat& dst, float* alpha,
{
__m128 _rows0 = (__m128)__lsx_vld(rows0p, 0);
__m128 _rows1 = (__m128)__lsx_vld(rows1p, 0);
__m128 _D = __lsx_vfmul_s(_rows0, _b0);
_D = __lsx_vfmadd_s(_b1, _rows1, _D);
__lsx_vst(_D, Dp, 0);
__m128 _Dp = __lsx_vfmul_s(_rows0, _b0);
_Dp = __lsx_vfmadd_s(_b1, _rows1, _Dp);
__lsx_vst(_Dp, Dp, 0);
Dp += 4;
rows0p += 4;
......
......@@ -500,13 +500,46 @@ int BinaryOp_mips::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.h * B.elempack)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 2;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 3;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 4;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 3)
......@@ -516,13 +549,46 @@ int BinaryOp_mips::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.h * A.elempack)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 2;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 3;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 4;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 3)
......
......@@ -268,11 +268,11 @@ static void resize_bicubic_image_pack4(const Mat& src, Mat& dst, float* alpha, i
v4f32 _rows1 = (v4f32)__msa_ld_w(rows1p, 0);
v4f32 _rows2 = (v4f32)__msa_ld_w(rows2p, 0);
v4f32 _rows3 = (v4f32)__msa_ld_w(rows3p, 0);
v4f32 _D = __msa_fmul_w(_rows0, _b0);
_D = __msa_fmadd_w(_D, _rows1, _b1);
_D = __msa_fmadd_w(_D, _rows2, _b2);
_D = __msa_fmadd_w(_D, _rows3, _b3);
__msa_st_w((v4i32)_D, Dp, 0);
v4f32 _Dp = __msa_fmul_w(_rows0, _b0);
_Dp = __msa_fmadd_w(_Dp, _rows1, _b1);
_Dp = __msa_fmadd_w(_Dp, _rows2, _b2);
_Dp = __msa_fmadd_w(_Dp, _rows3, _b3);
__msa_st_w((v4i32)_Dp, Dp, 0);
Dp += 4;
rows0p += 4;
......
......@@ -143,18 +143,18 @@ static void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* x
v4f32 _rows0 = (v4f32)__msa_ld_w(rows0p, 0);
v4f32 _rows1 = (v4f32)__msa_ld_w(rows1p, 0);
v4f32 _D = __msa_fmul_w(_rows0, _b0);
_D = __msa_fmadd_w(_D, _rows1, _b1);
v4f32 _Dp = __msa_fmul_w(_rows0, _b0);
_Dp = __msa_fmadd_w(_Dp, _rows1, _b1);
__msa_st_w((v4i32)_D, Dp, 0);
__msa_st_w((v4i32)_Dp, Dp, 0);
v4f32 _rows0n = (v4f32)__msa_ld_w(rows0p + 4, 0);
v4f32 _rows1n = (v4f32)__msa_ld_w(rows1p + 4, 0);
v4f32 _Dn = __msa_fmul_w(_rows0n, _b0);
_Dn = __msa_fmadd_w(_Dn, _rows1n, _b1);
v4f32 _Dpn = __msa_fmul_w(_rows0n, _b0);
_Dpn = __msa_fmadd_w(_Dpn, _rows1n, _b1);
__msa_st_w((v4i32)_Dn, Dp + 4, 0);
__msa_st_w((v4i32)_Dpn, Dp + 4, 0);
Dp += 8;
rows0p += 8;
......
......@@ -109,9 +109,9 @@ static void resize_bilinear_image_pack4(const Mat& src, Mat& dst, float* alpha,
{
v4f32 _rows0 = (v4f32)__msa_ld_w(rows0p, 0);
v4f32 _rows1 = (v4f32)__msa_ld_w(rows1p, 0);
v4f32 _D = __msa_fmul_w(_rows0, _b0);
_D = __msa_fmadd_w(_D, _rows1, _b1);
__msa_st_w((v4i32)_D, Dp, 0);
v4f32 _Dp = __msa_fmul_w(_rows0, _b0);
_Dp = __msa_fmadd_w(_Dp, _rows1, _b1);
__msa_st_w((v4i32)_Dp, Dp, 0);
Dp += 4;
rows0p += 4;
......
......@@ -490,13 +490,46 @@ int BinaryOp_riscv::forward(const std::vector<Mat>& bottom_blobs, std::vector<Ma
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.h * B.elempack)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 2;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 3;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 4;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 3)
......@@ -506,13 +539,46 @@ int BinaryOp_riscv::forward(const std::vector<Mat>& bottom_blobs, std::vector<Ma
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.h * A.elempack)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 2;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 3;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 4;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 3)
......@@ -996,13 +1062,46 @@ int BinaryOp_riscv::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vec
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.h * B.elempack)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 2;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 3;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 4;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 3)
......@@ -1012,13 +1111,46 @@ int BinaryOp_riscv::forward_fp16s(const std::vector<Mat>& bottom_blobs, std::vec
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.h * A.elempack)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 2;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 3;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 4;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 3)
......
......@@ -226,9 +226,9 @@ static void resize_bicubic_image_packn(const Mat& src, Mat& dst, float* alpha, i
vfloat32m1_t _rows2 = vle32_v_f32m1(rows2p, vl);
vfloat32m1_t _rows3 = vle32_v_f32m1(rows3p, vl);
vfloat32m1_t _D = vfmacc_vf_f32m1(vfmacc_vf_f32m1(vfmacc_vf_f32m1(vfmul_vf_f32m1(_rows0, b0, vl), b1, _rows1, vl), b2, _rows2, vl), b3, _rows3, vl);
vfloat32m1_t _Dp = vfmacc_vf_f32m1(vfmacc_vf_f32m1(vfmacc_vf_f32m1(vfmul_vf_f32m1(_rows0, b0, vl), b1, _rows1, vl), b2, _rows2, vl), b3, _rows3, vl);
vse32_v_f32m1(Dp, _D, vl);
vse32_v_f32m1(Dp, _Dp, vl);
Dp += packn;
rows0p += packn;
......
......@@ -226,9 +226,9 @@ static void resize_bicubic_image_packn_fp16s(const Mat& src, Mat& dst, float* al
vfloat32m2_t _rows2 = vle32_v_f32m2(rows2p, vl);
vfloat32m2_t _rows3 = vle32_v_f32m2(rows3p, vl);
vfloat32m2_t _D = vfmacc_vf_f32m2(vfmacc_vf_f32m2(vfmacc_vf_f32m2(vfmul_vf_f32m2(_rows0, b0, vl), b1, _rows1, vl), b2, _rows2, vl), b3, _rows3, vl);
vfloat32m2_t _Dp = vfmacc_vf_f32m2(vfmacc_vf_f32m2(vfmacc_vf_f32m2(vfmul_vf_f32m2(_rows0, b0, vl), b1, _rows1, vl), b2, _rows2, vl), b3, _rows3, vl);
vse16_v_f16m1(Dp, vfncvt_f_f_w_f16m1(_D, vl), vl);
vse16_v_f16m1(Dp, vfncvt_f_f_w_f16m1(_Dp, vl), vl);
Dp += packn;
rows0p += packn;
......@@ -455,9 +455,9 @@ static void resize_bicubic_image_packn_fp16sa(const Mat& src, Mat& dst, __fp16*
vfloat16m1_t _rows2 = vle16_v_f16m1(rows2p, vl);
vfloat16m1_t _rows3 = vle16_v_f16m1(rows3p, vl);
vfloat16m1_t _D = vfmacc_vf_f16m1(vfmacc_vf_f16m1(vfmacc_vf_f16m1(vfmul_vf_f16m1(_rows0, b0, vl), b1, _rows1, vl), b2, _rows2, vl), b3, _rows3, vl);
vfloat16m1_t _Dp = vfmacc_vf_f16m1(vfmacc_vf_f16m1(vfmacc_vf_f16m1(vfmul_vf_f16m1(_rows0, b0, vl), b1, _rows1, vl), b2, _rows2, vl), b3, _rows3, vl);
vse16_v_f16m1(Dp, _D, vl);
vse16_v_f16m1(Dp, _Dp, vl);
Dp += packn;
rows0p += packn;
......
......@@ -200,9 +200,9 @@ static void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* x
vfloat32m8_t _rows0 = vle32_v_f32m8(rows0p, vl);
vfloat32m8_t _rows1 = vle32_v_f32m8(rows1p, vl);
vfloat32m8_t _D = vfmacc_vf_f32m8(vfmul_vf_f32m8(_rows0, b0, vl), b1, _rows1, vl);
vfloat32m8_t _Dp = vfmacc_vf_f32m8(vfmul_vf_f32m8(_rows0, b0, vl), b1, _rows1, vl);
vse32_v_f32m8(Dp, _D, vl);
vse32_v_f32m8(Dp, _Dp, vl);
Dp += vl;
rows0p += vl;
......
......@@ -136,9 +136,9 @@ static void resize_bilinear_image_fp16s(const Mat& src, Mat& dst, float* alpha,
vfloat32m8_t _rows0 = vle32_v_f32m8(rows0p, vl);
vfloat32m8_t _rows1 = vle32_v_f32m8(rows1p, vl);
vfloat32m8_t _D = vfmacc_vf_f32m8(vfmul_vf_f32m8(_rows0, b0, vl), b1, _rows1, vl);
vfloat32m8_t _Dp = vfmacc_vf_f32m8(vfmul_vf_f32m8(_rows0, b0, vl), b1, _rows1, vl);
vse16_v_f16m4(Dp, vfncvt_f_f_w_f16m4(_D, vl), vl);
vse16_v_f16m4(Dp, vfncvt_f_f_w_f16m4(_Dp, vl), vl);
Dp += vl;
rows0p += vl;
......@@ -237,9 +237,9 @@ static void resize_bilinear_image_fp16sa(const Mat& src, Mat& dst, __fp16* alpha
vfloat16m8_t _rows0 = vle16_v_f16m8(rows0p, vl);
vfloat16m8_t _rows1 = vle16_v_f16m8(rows1p, vl);
vfloat16m8_t _D = vfmacc_vf_f16m8(vfmul_vf_f16m8(_rows0, b0, vl), b1, _rows1, vl);
vfloat16m8_t _Dp = vfmacc_vf_f16m8(vfmul_vf_f16m8(_rows0, b0, vl), b1, _rows1, vl);
vse16_v_f16m8(Dp, _D, vl);
vse16_v_f16m8(Dp, _Dp, vl);
Dp += vl;
rows0p += vl;
......
......@@ -106,9 +106,9 @@ static void resize_bilinear_image_packn(const Mat& src, Mat& dst, float* alpha,
vfloat32m1_t _rows0 = vle32_v_f32m1(rows0p, vl);
vfloat32m1_t _rows1 = vle32_v_f32m1(rows1p, vl);
vfloat32m1_t _D = vfmacc_vf_f32m1(vfmul_vf_f32m1(_rows0, b0, vl), b1, _rows1, vl);
vfloat32m1_t _Dp = vfmacc_vf_f32m1(vfmul_vf_f32m1(_rows0, b0, vl), b1, _rows1, vl);
vse32_v_f32m1(Dp, _D, vl);
vse32_v_f32m1(Dp, _Dp, vl);
Dp += packn;
rows0p += packn;
......
......@@ -106,9 +106,9 @@ static void resize_bilinear_image_packn_fp16s(const Mat& src, Mat& dst, float* a
vfloat32m2_t _rows0 = vle32_v_f32m2(rows0p, vl);
vfloat32m2_t _rows1 = vle32_v_f32m2(rows1p, vl);
vfloat32m2_t _D = vfmacc_vf_f32m2(vfmul_vf_f32m2(_rows0, b0, vl), b1, _rows1, vl);
vfloat32m2_t _Dp = vfmacc_vf_f32m2(vfmul_vf_f32m2(_rows0, b0, vl), b1, _rows1, vl);
vse16_v_f16m1(Dp, vfncvt_f_f_w_f16m1(_D, vl), vl);
vse16_v_f16m1(Dp, vfncvt_f_f_w_f16m1(_Dp, vl), vl);
Dp += packn;
rows0p += packn;
......@@ -213,9 +213,9 @@ static void resize_bilinear_image_packn_fp16sa(const Mat& src, Mat& dst, __fp16*
vfloat16m1_t _rows0 = vle16_v_f16m1(rows0p, vl);
vfloat16m1_t _rows1 = vle16_v_f16m1(rows1p, vl);
vfloat16m1_t _D = vfmacc_vf_f16m1(vfmul_vf_f16m1(_rows0, b0, vl), b1, _rows1, vl);
vfloat16m1_t _Dp = vfmacc_vf_f16m1(vfmul_vf_f16m1(_rows0, b0, vl), b1, _rows1, vl);
vse16_v_f16m1(Dp, _D, vl);
vse16_v_f16m1(Dp, _Dp, vl);
Dp += packn;
rows0p += packn;
......
......@@ -201,14 +201,22 @@ int BinaryOp_vulkan::create_pipeline(const Option& opt)
if (A_shape.dims > 0 && B_shape.dims > 0)
{
const bool a_rank_is_lower = A_shape.dims < B_shape.dims;
const bool a_rank_is_equal = A_shape.dims == B_shape.dims;
const bool a_pack_is_lower = A_elempack < B_elempack;
const bool a_pack_is_equal = A_elempack == B_elempack;
const bool a_size_is_lower = A_shape.w * A_shape.h * A_shape.d * A_shape.c < B_shape.w * B_shape.h * B_shape.d * B_shape.c;
if (a_rank_is_lower || a_pack_is_lower || (a_pack_is_equal && a_size_is_lower))
if (a_rank_is_lower || (a_rank_is_equal && a_pack_is_lower) || (a_pack_is_equal && a_size_is_lower))
{
// swap AB
std::swap(A_shape_packed, B_shape_packed);
}
if (B_shape_packed.dims == 1 && ((A_shape_packed.dims == 2 && B_shape_packed.w * B_shape_packed.elempack != A_shape_packed.h * A_shape_packed.elempack) || ((A_shape_packed.dims == 3 || A_shape_packed.dims == 4) && B_shape_packed.w * B_shape_packed.elempack != A_shape_packed.c * A_shape_packed.elempack)))
{
B_shape_packed.dims = out_shape.dims;
B_shape_packed.w = B_shape_packed.w * B_shape_packed.elempack;
B_shape_packed.elempack = 1;
}
}
const int op_type_r = get_reverse_op_type(op_type);
......@@ -298,7 +306,7 @@ int BinaryOp_vulkan::create_pipeline(const Option& opt)
}
// pack4
if (out_shape.dims == 0 || (A_elempack == 4 && B_elempack == 4 && out_elempack == 4))
if (out_shape.dims == 0 || (A_shape_packed.elempack == 4 && B_shape_packed.elempack == 4 && out_elempack == 4))
{
pipeline_binaryop_broadcast_pack4[0] = new Pipeline(vkdev);
pipeline_binaryop_broadcast_pack4[0]->set_optimal_local_size_xyz(local_size_xyz);
......@@ -314,7 +322,7 @@ int BinaryOp_vulkan::create_pipeline(const Option& opt)
}
// pack1to4
if (out_shape.dims == 0 || ((A_elempack == 1 || B_elempack == 1) && out_elempack == 4))
if (out_shape.dims == 0 || ((A_shape_packed.elempack == 1 || B_shape_packed.elempack == 1) && out_elempack == 4))
{
pipeline_binaryop_broadcast_pack1to4[0] = new Pipeline(vkdev);
pipeline_binaryop_broadcast_pack1to4[0]->set_optimal_local_size_xyz(local_size_xyz);
......@@ -330,7 +338,7 @@ int BinaryOp_vulkan::create_pipeline(const Option& opt)
}
// pack8
if ((opt.use_shader_pack8 && out_shape.dims == 0) || (A_elempack == 8 && B_elempack == 8 && out_elempack == 8))
if ((opt.use_shader_pack8 && out_shape.dims == 0) || (A_shape_packed.elempack == 8 && B_shape_packed.elempack == 8 && out_elempack == 8))
{
pipeline_binaryop_broadcast_pack8[0] = new Pipeline(vkdev);
pipeline_binaryop_broadcast_pack8[0]->set_optimal_local_size_xyz(local_size_xyz);
......@@ -346,7 +354,7 @@ int BinaryOp_vulkan::create_pipeline(const Option& opt)
}
// pack1to8
if ((opt.use_shader_pack8 && out_shape.dims == 0) || ((A_elempack == 1 || B_elempack == 1) && out_elempack == 8))
if ((opt.use_shader_pack8 && out_shape.dims == 0) || ((A_shape_packed.elempack == 1 || B_shape_packed.elempack == 1) && out_elempack == 8))
{
pipeline_binaryop_broadcast_pack1to8[0] = new Pipeline(vkdev);
pipeline_binaryop_broadcast_pack1to8[0]->set_optimal_local_size_xyz(local_size_xyz);
......@@ -409,10 +417,10 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
const VkMat& A = bottom_blobs[0];
const VkMat& B = bottom_blobs[1];
const int outdims = std::max(A.dims, B.dims);
const int out_elempack = std::max(A.elempack, B.elempack);
const bool a_rank_is_lower = A.dims < B.dims;
const bool b_rank_is_lower = B.dims < A.dims;
const bool a_rank_is_equal = A.dims == B.dims;
VkMat& top_blob = top_blobs[0];
if (a_rank_is_lower)
......@@ -429,6 +437,7 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
const int outh = std::max(A.h, B.h);
const int outd = std::max(A.d, B.d);
const int outc = std::max(A.c, B.c);
const int out_elempack = std::max(A.elempack, B.elempack);
const size_t out_elemsize = std::max(A.elemsize, B.elemsize);
if (outdims == 1)
......@@ -476,8 +485,8 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
constants[13].i = top_blob.c;
constants[14].i = top_blob.cstep;
const Pipeline* pipeline = out_elempack == 8 ? pipeline_binaryop_pack8
: out_elempack == 4 ? pipeline_binaryop_pack4
const Pipeline* pipeline = top_blob.elempack == 8 ? pipeline_binaryop_pack8
: top_blob.elempack == 4 ? pipeline_binaryop_pack4
: pipeline_binaryop;
cmd.record_pipeline(pipeline, bindings, constants, top_blob);
......@@ -488,11 +497,22 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
const bool a_pack_is_lower = A.elempack < B.elempack;
const bool a_pack_is_equal = A.elempack == B.elempack;
const bool a_size_is_lower = A.w * A.h * A.d * A.c * A.elempack < B.w * B.h * B.d * B.c * B.elempack;
if (a_rank_is_lower || a_pack_is_lower || (a_pack_is_equal && a_size_is_lower))
if (a_rank_is_lower || (a_rank_is_equal && a_pack_is_lower) || (a_pack_is_equal && a_size_is_lower))
{
VkMat A2;
if (A.dims == 1 && ((B.dims == 2 && A.w * A.elempack != B.h * B.elempack) || ((B.dims == 3 || B.dims == 4) && A.w * A.elempack != B.c * B.elempack)))
{
vkdev->convert_packing(A, A2, 1, cmd, opt);
A2.dims = top_blob.dims;
}
else
{
A2 = A;
}
std::vector<VkMat> bindings(3);
bindings[0] = B;
bindings[1] = A;
bindings[1] = A2;
bindings[2] = top_blob;
std::vector<vk_constant_type> constants(18);
......@@ -502,12 +522,12 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
constants[3].i = B.d;
constants[4].i = B.c;
constants[5].i = B.cstep;
constants[6].i = A.dims;
constants[7].i = A.w;
constants[8].i = A.h;
constants[9].i = A.d;
constants[10].i = A.c;
constants[11].i = A.cstep;
constants[6].i = A2.dims;
constants[7].i = A2.w;
constants[8].i = A2.h;
constants[9].i = A2.d;
constants[10].i = A2.c;
constants[11].i = A2.cstep;
constants[12].i = top_blob.dims;
constants[13].i = top_blob.w;
constants[14].i = top_blob.h;
......@@ -518,23 +538,23 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
const int ri = get_reverse_op_type(op_type) == op_type ? 0 : 1;
const Pipeline* pipeline = 0;
if (A.elempack == 1 && out_elempack == 1)
if (A2.elempack == 1 && top_blob.elempack == 1)
{
pipeline = pipeline_binaryop_broadcast[ri];
}
if (A.elempack == 4 && out_elempack == 4)
if (A2.elempack == 4 && top_blob.elempack == 4)
{
pipeline = pipeline_binaryop_broadcast_pack4[ri];
}
if (A.elempack == 1 && out_elempack == 4)
if (A2.elempack == 1 && top_blob.elempack == 4)
{
pipeline = pipeline_binaryop_broadcast_pack1to4[ri];
}
if (A.elempack == 8 && out_elempack == 8)
if (A2.elempack == 8 && top_blob.elempack == 8)
{
pipeline = pipeline_binaryop_broadcast_pack8[ri];
}
if (A.elempack == 1 && out_elempack == 8)
if (A2.elempack == 1 && top_blob.elempack == 8)
{
pipeline = pipeline_binaryop_broadcast_pack1to8[ri];
}
......@@ -543,9 +563,20 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
}
else
{
VkMat B2;
if (B.dims == 1 && ((A.dims == 2 && B.w * B.elempack != A.h * A.elempack) || ((A.dims == 3 || A.dims == 4) && B.w * B.elempack != A.c * A.elempack)))
{
vkdev->convert_packing(B, B2, 1, cmd, opt);
B2.dims = top_blob.dims;
}
else
{
B2 = B;
}
std::vector<VkMat> bindings(3);
bindings[0] = A;
bindings[1] = B;
bindings[1] = B2;
bindings[2] = top_blob;
std::vector<vk_constant_type> constants(18);
......@@ -555,12 +586,12 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
constants[3].i = A.d;
constants[4].i = A.c;
constants[5].i = A.cstep;
constants[6].i = B.dims;
constants[7].i = B.w;
constants[8].i = B.h;
constants[9].i = B.d;
constants[10].i = B.c;
constants[11].i = B.cstep;
constants[6].i = B2.dims;
constants[7].i = B2.w;
constants[8].i = B2.h;
constants[9].i = B2.d;
constants[10].i = B2.c;
constants[11].i = B2.cstep;
constants[12].i = top_blob.dims;
constants[13].i = top_blob.w;
constants[14].i = top_blob.h;
......@@ -569,23 +600,23 @@ int BinaryOp_vulkan::forward(const std::vector<VkMat>& bottom_blobs, std::vector
constants[17].i = top_blob.cstep;
const Pipeline* pipeline = 0;
if (B.elempack == 1 && out_elempack == 1)
if (B2.elempack == 1 && top_blob.elempack == 1)
{
pipeline = pipeline_binaryop_broadcast[0];
}
if (B.elempack == 4 && out_elempack == 4)
if (B2.elempack == 4 && top_blob.elempack == 4)
{
pipeline = pipeline_binaryop_broadcast_pack4[0];
}
if (B.elempack == 1 && out_elempack == 4)
if (B2.elempack == 1 && top_blob.elempack == 4)
{
pipeline = pipeline_binaryop_broadcast_pack1to4[0];
}
if (B.elempack == 8 && out_elempack == 8)
if (B2.elempack == 8 && top_blob.elempack == 8)
{
pipeline = pipeline_binaryop_broadcast_pack8[0];
}
if (B.elempack == 1 && out_elempack == 8)
if (B2.elempack == 1 && top_blob.elempack == 8)
{
pipeline = pipeline_binaryop_broadcast_pack1to8[0];
}
......@@ -626,10 +657,10 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
const VkImageMat& A = bottom_blobs[0];
const VkImageMat& B = bottom_blobs[1];
const int outdims = std::max(A.dims, B.dims);
const int out_elempack = std::max(A.elempack, B.elempack);
const bool a_rank_is_lower = A.dims < B.dims;
const bool b_rank_is_lower = B.dims < A.dims;
const bool a_rank_is_equal = A.dims == B.dims;
VkImageMat& top_blob = top_blobs[0];
if (a_rank_is_lower)
......@@ -646,6 +677,7 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
const int outh = std::max(A.h, B.h);
const int outd = std::max(A.d, B.d);
const int outc = std::max(A.c, B.c);
const int out_elempack = std::max(A.elempack, B.elempack);
const size_t out_elemsize = std::max(A.elemsize, B.elemsize);
if (outdims == 1)
......@@ -693,8 +725,8 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
constants[13].i = top_blob.c;
constants[14].i = 0; //top_blob.cstep;
const Pipeline* pipeline = out_elempack == 8 ? pipeline_binaryop_pack8
: out_elempack == 4 ? pipeline_binaryop_pack4
const Pipeline* pipeline = top_blob.elempack == 8 ? pipeline_binaryop_pack8
: top_blob.elempack == 4 ? pipeline_binaryop_pack4
: pipeline_binaryop;
cmd.record_pipeline(pipeline, bindings, constants, top_blob);
......@@ -705,11 +737,22 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
const bool a_pack_is_lower = A.elempack < B.elempack;
const bool a_pack_is_equal = A.elempack == B.elempack;
const bool a_size_is_lower = A.w * A.h * A.d * A.c * A.elempack < B.w * B.h * B.d * B.c * B.elempack;
if (a_rank_is_lower || a_pack_is_lower || (a_pack_is_equal && a_size_is_lower))
if (a_rank_is_lower || (a_rank_is_equal && a_pack_is_lower) || (a_pack_is_equal && a_size_is_lower))
{
VkImageMat A2;
if (A.dims == 1 && ((B.dims == 2 && A.w * A.elempack != B.h * B.elempack) || ((B.dims == 3 || B.dims == 4) && A.w * A.elempack != B.c * B.elempack)))
{
vkdev->convert_packing(A, A2, 1, cmd, opt);
A2.dims = top_blob.dims;
}
else
{
A2 = A;
}
std::vector<VkImageMat> bindings(3);
bindings[0] = B;
bindings[1] = A;
bindings[1] = A2;
bindings[2] = top_blob;
std::vector<vk_constant_type> constants(18);
......@@ -719,12 +762,12 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
constants[3].i = B.d;
constants[4].i = B.c;
constants[5].i = 0; //B.cstep;
constants[6].i = A.dims;
constants[7].i = A.w;
constants[8].i = A.h;
constants[9].i = A.d;
constants[10].i = A.c;
constants[11].i = 0; //A.cstep;
constants[6].i = A2.dims;
constants[7].i = A2.w;
constants[8].i = A2.h;
constants[9].i = A2.d;
constants[10].i = A2.c;
constants[11].i = 0; //A2.cstep;
constants[12].i = top_blob.dims;
constants[13].i = top_blob.w;
constants[14].i = top_blob.h;
......@@ -735,23 +778,23 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
const int ri = get_reverse_op_type(op_type) == op_type ? 0 : 1;
const Pipeline* pipeline = 0;
if (A.elempack == 1 && out_elempack == 1)
if (A2.elempack == 1 && top_blob.elempack == 1)
{
pipeline = pipeline_binaryop_broadcast[ri];
}
if (A.elempack == 4 && out_elempack == 4)
if (A2.elempack == 4 && top_blob.elempack == 4)
{
pipeline = pipeline_binaryop_broadcast_pack4[ri];
}
if (A.elempack == 1 && out_elempack == 4)
if (A2.elempack == 1 && top_blob.elempack == 4)
{
pipeline = pipeline_binaryop_broadcast_pack1to4[ri];
}
if (A.elempack == 8 && out_elempack == 8)
if (A2.elempack == 8 && top_blob.elempack == 8)
{
pipeline = pipeline_binaryop_broadcast_pack8[ri];
}
if (A.elempack == 1 && out_elempack == 8)
if (A2.elempack == 1 && top_blob.elempack == 8)
{
pipeline = pipeline_binaryop_broadcast_pack1to8[ri];
}
......@@ -760,9 +803,20 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
}
else
{
VkImageMat B2;
if (B.dims == 1 && ((A.dims == 2 && B.w * B.elempack != A.h * A.elempack) || ((A.dims == 3 || A.dims == 4) && B.w * B.elempack != A.c * A.elempack)))
{
vkdev->convert_packing(B, B2, 1, cmd, opt);
B2.dims = top_blob.dims;
}
else
{
B2 = B;
}
std::vector<VkImageMat> bindings(3);
bindings[0] = A;
bindings[1] = B;
bindings[1] = B2;
bindings[2] = top_blob;
std::vector<vk_constant_type> constants(18);
......@@ -772,12 +826,12 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
constants[3].i = A.d;
constants[4].i = A.c;
constants[5].i = 0; //A.cstep;
constants[6].i = B.dims;
constants[7].i = B.w;
constants[8].i = B.h;
constants[9].i = B.d;
constants[10].i = B.c;
constants[11].i = 0; //B.cstep;
constants[6].i = B2.dims;
constants[7].i = B2.w;
constants[8].i = B2.h;
constants[9].i = B2.d;
constants[10].i = B2.c;
constants[11].i = 0; //B2.cstep;
constants[12].i = top_blob.dims;
constants[13].i = top_blob.w;
constants[14].i = top_blob.h;
......@@ -786,23 +840,23 @@ int BinaryOp_vulkan::forward(const std::vector<VkImageMat>& bottom_blobs, std::v
constants[17].i = 0; //top_blob.cstep;
const Pipeline* pipeline = 0;
if (B.elempack == 1 && out_elempack == 1)
if (B2.elempack == 1 && top_blob.elempack == 1)
{
pipeline = pipeline_binaryop_broadcast[0];
}
if (B.elempack == 4 && out_elempack == 4)
if (B2.elempack == 4 && top_blob.elempack == 4)
{
pipeline = pipeline_binaryop_broadcast_pack4[0];
}
if (B.elempack == 1 && out_elempack == 4)
if (B2.elempack == 1 && top_blob.elempack == 4)
{
pipeline = pipeline_binaryop_broadcast_pack1to4[0];
}
if (B.elempack == 8 && out_elempack == 8)
if (B2.elempack == 8 && top_blob.elempack == 8)
{
pipeline = pipeline_binaryop_broadcast_pack8[0];
}
if (B.elempack == 1 && out_elempack == 8)
if (B2.elempack == 1 && top_blob.elempack == 8)
{
pipeline = pipeline_binaryop_broadcast_pack1to8[0];
}
......
......@@ -972,13 +972,46 @@ int BinaryOp_x86::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>
{
// expand inner axes
if (outdims == 2)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.h * B.elempack)
A2 = A.reshape(1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 2;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 1)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 3;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 3 && A.dims == 2)
A2 = A.reshape(1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 1)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
{
if (A.w * A.elempack == B.c * B.elempack)
A2 = A.reshape(1, 1, 1, A.w, opt.workspace_allocator);
else // if (A.w == B.w)
{
A2.dims = 4;
A2.w = A.w * A.elempack;
A2.elempack = 1;
A2.elemsize = A.elemsize / A.elempack;
A2.cstep = A2.w;
}
}
if (outdims == 4 && A.dims == 2)
A2 = A.reshape(1, 1, A.w, A.h, opt.workspace_allocator);
if (outdims == 4 && A.dims == 3)
......@@ -988,13 +1021,46 @@ int BinaryOp_x86::forward(const std::vector<Mat>& bottom_blobs, std::vector<Mat>
{
// expand inner axes
if (outdims == 2)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.h * A.elempack)
B2 = B.reshape(1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 2;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 1)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 3;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 3 && B.dims == 2)
B2 = B.reshape(1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 1)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
{
if (B.w * B.elempack == A.c * A.elempack)
B2 = B.reshape(1, 1, 1, B.w, opt.workspace_allocator);
else // if (B.w == A.w)
{
B2.dims = 4;
B2.w = B.w * B.elempack;
B2.elempack = 1;
B2.elemsize = B.elemsize / B.elempack;
B2.cstep = B2.w;
}
}
if (outdims == 4 && B.dims == 2)
B2 = B.reshape(1, 1, B.w, B.h, opt.workspace_allocator);
if (outdims == 4 && B.dims == 3)
......
......@@ -264,11 +264,11 @@ static void resize_bicubic_image(const Mat& src, Mat& dst, float* alpha, int* xo
__m256 _rows1 = _mm256_loadu_ps(rows1p);
__m256 _rows2 = _mm256_loadu_ps(rows2p);
__m256 _rows3 = _mm256_loadu_ps(rows3p);
__m256 _D = _mm256_mul_ps(_rows0, _b0_256);
_D = _mm256_comp_fmadd_ps(_rows1, _b1_256, _D);
_D = _mm256_comp_fmadd_ps(_rows2, _b2_256, _D);
_D = _mm256_comp_fmadd_ps(_rows3, _b3_256, _D);
_mm256_storeu_ps(Dp, _D);
__m256 _Dp = _mm256_mul_ps(_rows0, _b0_256);
_Dp = _mm256_comp_fmadd_ps(_rows1, _b1_256, _Dp);
_Dp = _mm256_comp_fmadd_ps(_rows2, _b2_256, _Dp);
_Dp = _mm256_comp_fmadd_ps(_rows3, _b3_256, _Dp);
_mm256_storeu_ps(Dp, _Dp);
Dp += 8;
rows0p += 8;
......@@ -287,11 +287,11 @@ static void resize_bicubic_image(const Mat& src, Mat& dst, float* alpha, int* xo
__m128 _rows1 = _mm_loadu_ps(rows1p);
__m128 _rows2 = _mm_loadu_ps(rows2p);
__m128 _rows3 = _mm_loadu_ps(rows3p);
__m128 _D = _mm_mul_ps(_rows0, _b0_128);
_D = _mm_comp_fmadd_ps(_rows1, _b1_128, _D);
_D = _mm_comp_fmadd_ps(_rows2, _b2_128, _D);
_D = _mm_comp_fmadd_ps(_rows3, _b3_128, _D);
_mm_storeu_ps(Dp, _D);
__m128 _Dp = _mm_mul_ps(_rows0, _b0_128);
_Dp = _mm_comp_fmadd_ps(_rows1, _b1_128, _Dp);
_Dp = _mm_comp_fmadd_ps(_rows2, _b2_128, _Dp);
_Dp = _mm_comp_fmadd_ps(_rows3, _b3_128, _Dp);
_mm_storeu_ps(Dp, _Dp);
Dp += 4;
rows0p += 4;
......
......@@ -268,11 +268,11 @@ static void resize_bicubic_image_pack16(const Mat& src, Mat& dst, float* alpha,
__m512 _rows1 = _mm512_load_ps(rows1p);
__m512 _rows2 = _mm512_load_ps(rows2p);
__m512 _rows3 = _mm512_load_ps(rows3p);
__m512 _D = _mm512_mul_ps(_rows0, _b0);
_D = _mm512_fmadd_ps(_rows1, _b1, _D);
_D = _mm512_fmadd_ps(_rows2, _b2, _D);
_D = _mm512_fmadd_ps(_rows3, _b3, _D);
_mm512_store_ps(Dp, _D);
__m512 _Dp = _mm512_mul_ps(_rows0, _b0);
_Dp = _mm512_fmadd_ps(_rows1, _b1, _Dp);
_Dp = _mm512_fmadd_ps(_rows2, _b2, _Dp);
_Dp = _mm512_fmadd_ps(_rows3, _b3, _Dp);
_mm512_store_ps(Dp, _Dp);
Dp += 16;
rows0p += 16;
......
......@@ -268,11 +268,11 @@ static void resize_bicubic_image_pack4(const Mat& src, Mat& dst, float* alpha, i
__m128 _rows1 = _mm_load_ps(rows1p);
__m128 _rows2 = _mm_load_ps(rows2p);
__m128 _rows3 = _mm_load_ps(rows3p);
__m128 _D = _mm_mul_ps(_rows0, _b0);
_D = _mm_comp_fmadd_ps(_rows1, _b1, _D);
_D = _mm_comp_fmadd_ps(_rows2, _b2, _D);
_D = _mm_comp_fmadd_ps(_rows3, _b3, _D);
_mm_store_ps(Dp, _D);
__m128 _Dp = _mm_mul_ps(_rows0, _b0);
_Dp = _mm_comp_fmadd_ps(_rows1, _b1, _Dp);
_Dp = _mm_comp_fmadd_ps(_rows2, _b2, _Dp);
_Dp = _mm_comp_fmadd_ps(_rows3, _b3, _Dp);
_mm_store_ps(Dp, _Dp);
Dp += 4;
rows0p += 4;
......
......@@ -268,11 +268,11 @@ static void resize_bicubic_image_pack8(const Mat& src, Mat& dst, float* alpha, i
__m256 _rows1 = _mm256_load_ps(rows1p);
__m256 _rows2 = _mm256_load_ps(rows2p);
__m256 _rows3 = _mm256_load_ps(rows3p);
__m256 _D = _mm256_mul_ps(_rows0, _b0);
_D = _mm256_comp_fmadd_ps(_rows1, _b1, _D);
_D = _mm256_comp_fmadd_ps(_rows2, _b2, _D);
_D = _mm256_comp_fmadd_ps(_rows3, _b3, _D);
_mm256_store_ps(Dp, _D);
__m256 _Dp = _mm256_mul_ps(_rows0, _b0);
_Dp = _mm256_comp_fmadd_ps(_rows1, _b1, _Dp);
_Dp = _mm256_comp_fmadd_ps(_rows2, _b2, _Dp);
_Dp = _mm256_comp_fmadd_ps(_rows3, _b3, _Dp);
_mm256_store_ps(Dp, _Dp);
Dp += 8;
rows0p += 8;
......
......@@ -137,9 +137,9 @@ static void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* x
{
__m256 _rows0 = _mm256_loadu_ps(rows0p);
__m256 _rows1 = _mm256_loadu_ps(rows1p);
__m256 _D = _mm256_mul_ps(_rows0, _b0_256);
_D = _mm256_comp_fmadd_ps(_rows1, _b1_256, _D);
_mm256_storeu_ps(Dp, _D);
__m256 _Dp = _mm256_mul_ps(_rows0, _b0_256);
_Dp = _mm256_comp_fmadd_ps(_rows1, _b1_256, _Dp);
_mm256_storeu_ps(Dp, _Dp);
Dp += 8;
rows0p += 8;
......@@ -152,9 +152,9 @@ static void resize_bilinear_image(const Mat& src, Mat& dst, float* alpha, int* x
{
__m128 _rows0 = _mm_loadu_ps(rows0p);
__m128 _rows1 = _mm_loadu_ps(rows1p);
__m128 _D = _mm_mul_ps(_rows0, _b0_128);
_D = _mm_comp_fmadd_ps(_rows1, _b1_128, _D);
_mm_storeu_ps(Dp, _D);
__m128 _Dp = _mm_mul_ps(_rows0, _b0_128);
_Dp = _mm_comp_fmadd_ps(_rows1, _b1_128, _Dp);
_mm_storeu_ps(Dp, _Dp);
Dp += 4;
rows0p += 4;
......
......@@ -109,9 +109,9 @@ static void resize_bilinear_image_pack16(const Mat& src, Mat& dst, float* alpha,
{
__m512 _rows0 = _mm512_load_ps(rows0p);
__m512 _rows1 = _mm512_load_ps(rows1p);
__m512 _D = _mm512_mul_ps(_rows0, _b0);
_D = _mm512_fmadd_ps(_rows1, _b1, _D);
_mm512_store_ps(Dp, _D);
__m512 _Dp = _mm512_mul_ps(_rows0, _b0);
_Dp = _mm512_fmadd_ps(_rows1, _b1, _Dp);
_mm512_store_ps(Dp, _Dp);
Dp += 16;
rows0p += 16;
......
......@@ -109,9 +109,9 @@ static void resize_bilinear_image_pack4(const Mat& src, Mat& dst, float* alpha,
{
__m128 _rows0 = _mm_load_ps(rows0p);
__m128 _rows1 = _mm_load_ps(rows1p);
__m128 _D = _mm_mul_ps(_rows0, _b0);
_D = _mm_comp_fmadd_ps(_rows1, _b1, _D);
_mm_store_ps(Dp, _D);
__m128 _Dp = _mm_mul_ps(_rows0, _b0);
_Dp = _mm_comp_fmadd_ps(_rows1, _b1, _Dp);
_mm_store_ps(Dp, _Dp);
Dp += 4;
rows0p += 4;
......
......@@ -109,9 +109,9 @@ static void resize_bilinear_image_pack8(const Mat& src, Mat& dst, float* alpha,
{
__m256 _rows0 = _mm256_load_ps(rows0p);
__m256 _rows1 = _mm256_load_ps(rows1p);
__m256 _D = _mm256_mul_ps(_rows0, _b0);
_D = _mm256_comp_fmadd_ps(_rows1, _b1, _D);
_mm256_store_ps(Dp, _D);
__m256 _Dp = _mm256_mul_ps(_rows0, _b0);
_Dp = _mm256_comp_fmadd_ps(_rows1, _b1, _Dp);
_mm256_store_ps(Dp, _Dp);
Dp += 8;
rows0p += 8;
......
......@@ -474,24 +474,24 @@ static int lstm(const Mat& bottom_blob, Mat& top_blob, int reverse, const Mat& w
_MM_TRANSPOSE4_PS(_IFOG_4x4_0, _IFOG_4x4_1, _IFOG_4x4_2, _IFOG_4x4_3);
__m128 _I = sigmoid_sse(_IFOG_4x4_0);
__m128 _F = sigmoid_sse(_IFOG_4x4_1);
__m128 _O = sigmoid_sse(_IFOG_4x4_2);
__m128 _G = tanh_sse(_IFOG_4x4_3);
__m128 _lstm_I = sigmoid_sse(_IFOG_4x4_0);
__m128 _lstm_F = sigmoid_sse(_IFOG_4x4_1);
__m128 _lstm_O = sigmoid_sse(_IFOG_4x4_2);
__m128 _lstm_G = tanh_sse(_IFOG_4x4_3);
__m128 _cell2 = _mm_add_ps(_mm_mul_ps(_F, _mm_loadu_ps(cell_ptr + q)), _mm_mul_ps(_I, _G));
__m128 _H = _mm_mul_ps(_O, tanh_sse(_cell2));
__m128 _cell2 = _mm_add_ps(_mm_mul_ps(_lstm_F, _mm_loadu_ps(cell_ptr + q)), _mm_mul_ps(_lstm_I, _lstm_G));
__m128 _lstm_H = _mm_mul_ps(_lstm_O, tanh_sse(_cell2));
_mm_storeu_ps(cell_ptr + q, _cell2);
if (num_output == hidden_size)
{
_mm_storeu_ps(hidden_ptr + q, _H);
_mm_storeu_ps(output_data + q, _H);
_mm_storeu_ps(hidden_ptr + q, _lstm_H);
_mm_storeu_ps(output_data + q, _lstm_H);
}
else
{
_mm_storeu_ps(tmp_hidden_ptr + q, _H);
_mm_storeu_ps(tmp_hidden_ptr + q, _lstm_H);
}
}
#else // __SSE2__
......
......@@ -229,9 +229,9 @@ void resize_bilinear_c1(const unsigned char* src, int srcw, int srch, int srcstr
int16x4_t _acc16 = vshrn_n_s32(_acc, 2);
int16x4_t _acc16_1 = vshrn_n_s32(_acc_1, 2);
uint8x8_t _D = vqmovun_s16(vcombine_s16(_acc16, _acc16_1));
uint8x8_t _Dp = vqmovun_s16(vcombine_s16(_acc16, _acc16_1));
vst1_u8(Dp, _D);
vst1_u8(Dp, _Dp);
Dp += 8;
rows0p += 8;
......@@ -538,9 +538,9 @@ void resize_bilinear_c2(const unsigned char* src, int srcw, int srch, int srcstr
int16x4_t _acc16 = vshrn_n_s32(_acc, 2);
int16x4_t _acc16_1 = vshrn_n_s32(_acc_1, 2);
uint8x8_t _D = vqmovun_s16(vcombine_s16(_acc16, _acc16_1));
uint8x8_t _Dp = vqmovun_s16(vcombine_s16(_acc16, _acc16_1));
vst1_u8(Dp, _D);
vst1_u8(Dp, _Dp);
Dp += 8;
rows0p += 8;
......@@ -858,9 +858,9 @@ void resize_bilinear_c3(const unsigned char* src, int srcw, int srch, int srcstr
int16x4_t _acc16 = vshrn_n_s32(_acc, 2);
int16x4_t _acc16_1 = vshrn_n_s32(_acc_1, 2);
uint8x8_t _D = vqmovun_s16(vcombine_s16(_acc16, _acc16_1));
uint8x8_t _Dp = vqmovun_s16(vcombine_s16(_acc16, _acc16_1));
vst1_u8(Dp, _D);
vst1_u8(Dp, _Dp);
Dp += 8;
rows0p += 8;
......@@ -1158,9 +1158,9 @@ void resize_bilinear_c4(const unsigned char* src, int srcw, int srch, int srcstr
int16x4_t _acc16 = vshrn_n_s32(_acc, 2);
int16x4_t _acc16_1 = vshrn_n_s32(_acc_1, 2);
uint8x8_t _D = vqmovun_s16(vcombine_s16(_acc16, _acc16_1));
uint8x8_t _Dp = vqmovun_s16(vcombine_s16(_acc16, _acc16_1));
vst1_u8(Dp, _D);
vst1_u8(Dp, _Dp);
Dp += 8;
rows0p += 8;
......
......@@ -329,6 +329,55 @@ static int test_binaryop_5()
return 0;
}
static int test_binaryop_6()
{
const int ws[] = {16, 12, 16, 15};
const int hs[] = {15, 16, 15, 12};
const int ds[] = {12, 14, 12, 16};
const int cs[] = {31, 28, 24, 32};
for (int i = 0; i < 4; i++)
{
const int w = ws[i];
const int h = hs[i];
const int d = ds[i];
const int c = cs[i];
const int flag = c == 32 ? TEST_LAYER_DISABLE_GPU_TESTING : 0;
ncnn::Mat a[3] = {
RandomMat(d, c),
RandomMat(h, d, c),
RandomMat(w, h, d, c),
};
for (int j = 0; j < 3; j++)
{
ncnn::Mat b = RandomMat(a[j].w);
int ret = test_binaryop(a[j], b, flag) || test_binaryop(b, a[j], flag);
if (ret != 0)
return ret;
}
ncnn::Mat aa[3] = {
RandomMat(c, c),
RandomMat(c, d, c),
RandomMat(c, h, d, c),
};
for (int j = 0; j < 3; j++)
{
ncnn::Mat b = RandomMat(aa[j].w);
int ret = test_binaryop(aa[j], b, flag) || test_binaryop(b, aa[j], flag);
if (ret != 0)
return ret;
}
}
return 0;
}
int main()
{
SRAND(7767517);
......@@ -340,7 +389,8 @@ int main()
|| test_binaryop_2()
|| test_binaryop_3()
|| test_binaryop_4()
|| test_binaryop_5();
|| test_binaryop_5()
|| test_binaryop_6();
if (ret != 0)
return ret;
......
......@@ -329,6 +329,55 @@ static int test_binaryop_5()
return 0;
}
static int test_binaryop_6()
{
const int ws[] = {16, 12, 16, 15};
const int hs[] = {15, 16, 15, 12};
const int ds[] = {12, 14, 12, 16};
const int cs[] = {31, 28, 24, 32};
for (int i = 0; i < 4; i++)
{
const int w = ws[i];
const int h = hs[i];
const int d = ds[i];
const int c = cs[i];
const int flag = c == 32 ? TEST_LAYER_DISABLE_GPU_TESTING : 0;
ncnn::Mat a[3] = {
RandomMat(d, c),
RandomMat(h, d, c),
RandomMat(w, h, d, c),
};
for (int j = 0; j < 3; j++)
{
ncnn::Mat b = RandomMat(a[j].w);
int ret = test_binaryop(a[j], b, flag) || test_binaryop(b, a[j], flag);
if (ret != 0)
return ret;
}
ncnn::Mat aa[3] = {
RandomMat(c, c),
RandomMat(c, d, c),
RandomMat(c, h, d, c),
};
for (int j = 0; j < 3; j++)
{
ncnn::Mat b = RandomMat(aa[j].w);
int ret = test_binaryop(aa[j], b, flag) || test_binaryop(b, aa[j], flag);
if (ret != 0)
return ret;
}
}
return 0;
}
int main()
{
SRAND(7767517);
......@@ -340,7 +389,8 @@ int main()
|| test_binaryop_2()
|| test_binaryop_3()
|| test_binaryop_4()
|| test_binaryop_5();
|| test_binaryop_5()
|| test_binaryop_6();
if (ret != 0)
return ret;
......
......@@ -329,6 +329,55 @@ static int test_binaryop_5()
return 0;
}
static int test_binaryop_6()
{
const int ws[] = {16, 12, 16, 15};
const int hs[] = {15, 16, 15, 12};
const int ds[] = {12, 14, 12, 16};
const int cs[] = {31, 28, 24, 32};
for (int i = 0; i < 4; i++)
{
const int w = ws[i];
const int h = hs[i];
const int d = ds[i];
const int c = cs[i];
const int flag = c == 32 ? TEST_LAYER_DISABLE_GPU_TESTING : 0;
ncnn::Mat a[3] = {
RandomMat(d, c),
RandomMat(h, d, c),
RandomMat(w, h, d, c),
};
for (int j = 0; j < 3; j++)
{
ncnn::Mat b = RandomMat(a[j].w);
int ret = test_binaryop(a[j], b, flag) || test_binaryop(b, a[j], flag);
if (ret != 0)
return ret;
}
ncnn::Mat aa[3] = {
RandomMat(c, c),
RandomMat(c, d, c),
RandomMat(c, h, d, c),
};
for (int j = 0; j < 3; j++)
{
ncnn::Mat b = RandomMat(aa[j].w);
int ret = test_binaryop(aa[j], b, flag) || test_binaryop(b, aa[j], flag);
if (ret != 0)
return ret;
}
}
return 0;
}
int main()
{
SRAND(7767517);
......@@ -340,7 +389,8 @@ int main()
|| test_binaryop_2()
|| test_binaryop_3()
|| test_binaryop_4()
|| test_binaryop_5();
|| test_binaryop_5()
|| test_binaryop_6();
if (ret != 0)
return ret;
......
......@@ -329,6 +329,55 @@ static int test_binaryop_5()
return 0;
}
static int test_binaryop_6()
{
const int ws[] = {16, 12, 16, 15};
const int hs[] = {15, 16, 15, 12};
const int ds[] = {12, 14, 12, 16};
const int cs[] = {31, 28, 24, 32};
for (int i = 0; i < 4; i++)
{
const int w = ws[i];
const int h = hs[i];
const int d = ds[i];
const int c = cs[i];
const int flag = c == 32 ? TEST_LAYER_DISABLE_GPU_TESTING : 0;
ncnn::Mat a[3] = {
RandomMat(d, c),
RandomMat(h, d, c),
RandomMat(w, h, d, c),
};
for (int j = 0; j < 3; j++)
{
ncnn::Mat b = RandomMat(a[j].w);
int ret = test_binaryop(a[j], b, flag) || test_binaryop(b, a[j], flag);
if (ret != 0)
return ret;
}
ncnn::Mat aa[3] = {
RandomMat(c, c),
RandomMat(c, d, c),
RandomMat(c, h, d, c),
};
for (int j = 0; j < 3; j++)
{
ncnn::Mat b = RandomMat(aa[j].w);
int ret = test_binaryop(aa[j], b, flag) || test_binaryop(b, aa[j], flag);
if (ret != 0)
return ret;
}
}
return 0;
}
int main()
{
SRAND(7767517);
......@@ -340,7 +389,8 @@ int main()
|| test_binaryop_2()
|| test_binaryop_3()
|| test_binaryop_4()
|| test_binaryop_5();
|| test_binaryop_5()
|| test_binaryop_6();
if (ret != 0)
return ret;
......
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR powerpc64le)
set(CMAKE_C_COMPILER "clang")
set(CMAKE_CXX_COMPILER "clang++")
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_C_FLAGS "-target powerpc64le-linux-gnu -I/usr/powerpc64le-linux-gnu/include -mcpu=power8 -mtune=power8 -DNO_WARN_X86_INTRINSICS -D__MMX__ -D__SSE__ -D__SSSE3__")
set(CMAKE_CXX_FLAGS "-target powerpc64le-linux-gnu -I/usr/powerpc64le-linux-gnu/include -I/usr/powerpc64le-linux-gnu/include/c++/10/powerpc64le-linux-gnu -mcpu=power8 -mtune=power8 -DNO_WARN_X86_INTRINSICS -D__MMX__ -D__SSE__ -D__SSSE3__")
# cache flags
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS}" CACHE STRING "c flags")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}" CACHE STRING "c++ flags")
# Auto-translate SSE to VSX
set(NCNN_PPC64LE_VSX ON)
set(CMAKE_SYSTEM_NAME Linux)
set(CMAKE_SYSTEM_PROCESSOR powerpc64le)
set(CMAKE_C_COMPILER "powerpc64le-linux-gnu-gcc")
set(CMAKE_CXX_COMPILER "powerpc64le-linux-gnu-g++")
set(CMAKE_FIND_ROOT_PATH_MODE_PROGRAM NEVER)
set(CMAKE_FIND_ROOT_PATH_MODE_LIBRARY ONLY)
set(CMAKE_FIND_ROOT_PATH_MODE_INCLUDE ONLY)
set(CMAKE_C_FLAGS "-mcpu=power8 -mtune=power8 -DNO_WARN_X86_INTRINSICS -D__MMX__ -D__SSE__ -D__SSSE3__")
set(CMAKE_CXX_FLAGS "-mcpu=power8 -mtune=power8 -DNO_WARN_X86_INTRINSICS -D__MMX__ -D__SSE__ -D__SSSE3__")
# cache flags
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS}" CACHE STRING "c flags")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS}" CACHE STRING "c++ flags")
# Auto-translate SSE to VSX
set(NCNN_PPC64LE_VSX ON)
......@@ -2,6 +2,10 @@
find_package(Protobuf)
if(PROTOBUF_FOUND)
if(DEFINED Protobuf_VERSION AND Protobuf_VERSION VERSION_GREATER_EQUAL 3.22)
set(CMAKE_CXX_STANDARD 17)
endif()
protobuf_generate_cpp(CAFFE_PROTO_SRCS CAFFE_PROTO_HDRS caffe.proto)
add_executable(caffe2ncnn caffe2ncnn.cpp ${CAFFE_PROTO_SRCS} ${CAFFE_PROTO_HDRS})
target_include_directories(caffe2ncnn
......
......@@ -2,6 +2,10 @@
find_package(Protobuf)
if(PROTOBUF_FOUND)
if(DEFINED Protobuf_VERSION AND Protobuf_VERSION VERSION_GREATER_EQUAL 3.22)
set(CMAKE_CXX_STANDARD 17)
endif()
protobuf_generate_cpp(ONNX_PROTO_SRCS ONNX_PROTO_HDRS onnx.proto)
add_executable(onnx2ncnn onnx2ncnn.cpp ${ONNX_PROTO_SRCS} ${ONNX_PROTO_HDRS})
target_include_directories(onnx2ncnn
......
......@@ -520,8 +520,8 @@ TORCH_LIBRARY(upfirdn2d_op, m) {
|nn.LeakyReLU | :heavy_check_mark: | :heavy_check_mark: |
|nn.Linear | :heavy_check_mark: | :heavy_check_mark: |
|nn.LocalResponseNorm | :heavy_check_mark: | :heavy_check_mark: |
|nn.LogSigmoid | :heavy_check_mark: |
|nn.LogSoftmax | :heavy_check_mark: |
|nn.LogSigmoid | :heavy_check_mark: | :heavy_check_mark: |
|nn.LogSoftmax | :heavy_check_mark: | :heavy_check_mark: |
|nn.LPPool1d | :heavy_check_mark: |
|nn.LPPool2d | :heavy_check_mark: |
|nn.LSTM | :heavy_check_mark: | :heavy_check_mark: |
......@@ -626,8 +626,8 @@ TORCH_LIBRARY(upfirdn2d_op, m) {
|F.leaky_relu_ | :heavy_check_mark: | :heavy_check_mark: |
|F.linear | :heavy_check_mark: | :heavy_check_mark:* |
|F.local_response_norm | :heavy_check_mark: | :heavy_check_mark: |
|F.logsigmoid | :heavy_check_mark: |
|F.log_softmax | :heavy_check_mark: |
|F.logsigmoid | :heavy_check_mark: | :heavy_check_mark: |
|F.log_softmax | :heavy_check_mark: | :heavy_check_mark: |
|F.lp_pool1d | :heavy_check_mark: |
|F.lp_pool2d | :heavy_check_mark: |
|F.max_pool1d | :heavy_check_mark: | :heavy_check_mark: |
......
......@@ -428,6 +428,8 @@ set(pnnx_pass_ncnn_SRCS
pass_ncnn/F_leaky_relu.cpp
pass_ncnn/F_linear.cpp
pass_ncnn/F_local_response_norm.cpp
pass_ncnn/F_log_softmax.cpp
pass_ncnn/F_logsigmoid.cpp
pass_ncnn/F_max_pool1d.cpp
pass_ncnn/F_max_pool2d.cpp
pass_ncnn/F_max_pool3d.cpp
......@@ -485,6 +487,8 @@ set(pnnx_pass_ncnn_SRCS
pass_ncnn/nn_LeakyReLU.cpp
pass_ncnn/nn_Linear.cpp
pass_ncnn/nn_LocalResponseNorm.cpp
pass_ncnn/nn_LogSigmoid.cpp
pass_ncnn/nn_LogSoftmax.cpp
pass_ncnn/nn_LSTM.cpp
pass_ncnn/nn_MaxPool1d.cpp
pass_ncnn/nn_MaxPool2d.cpp
......@@ -537,6 +541,7 @@ set(pnnx_pass_ncnn_SRCS
pass_ncnn/torch_prod.cpp
pass_ncnn/torch_squeeze.cpp
pass_ncnn/torch_sum.cpp
pass_ncnn/torch_t.cpp
pass_ncnn/torch_transpose.cpp
pass_ncnn/torch_unsqueeze.cpp
pass_ncnn/torchvision_DeformConv2d.cpp
......
......@@ -1298,11 +1298,19 @@ static std::string expand_expression(const Operator* op)
}
else if (t == "atan2"
|| t == "fmod"
|| t == "max"
|| t == "maximum"
|| t == "min"
|| t == "minimum"
|| t == "pow")
{
std::string binaryop;
if (t == "atan2") binaryop = "torch.atan2";
if (t == "fmod") binaryop = "torch.fmod";
if (t == "max") binaryop = "torch.max";
if (t == "maximum") binaryop = "torch.maximum";
if (t == "min") binaryop = "torch.min";
if (t == "minimum") binaryop = "torch.minimum";
if (t == "pow") binaryop = "torch.pow";
std::string a = exprstack.top();
......@@ -1313,7 +1321,17 @@ static std::string expand_expression(const Operator* op)
std::string r = binaryop + "(" + a + ", " + b + ")";
exprstack.push(r);
}
else if (t == "add" || t == "sub" || t == "mul" || t == "div" || t == "floor_divide" || t == "remainder" || t == "and" || t == "or" || t == "xor" || t == "lshift" || t == "rshift")
else if (t == "add"
|| t == "sub"
|| t == "mul"
|| t == "div"
|| t == "floor_divide"
|| t == "remainder"
|| t == "and"
|| t == "or"
|| t == "xor"
|| t == "lshift"
|| t == "rshift")
{
std::string binaryop;
if (t == "add") binaryop = "+";
......
......@@ -29,7 +29,7 @@ void reset_device(std::shared_ptr<torch::jit::Graph>& graph, const std::string&
if (dtype_node->hasAttribute(torch::jit::attr::value))
{
// change dtype=half to dtype=float
if (dtype_node->i(torch::jit::attr::value) == 5)
if (dtype_node->kindOf(torch::jit::attr::value) == torch::jit::AttributeKind::i && dtype_node->i(torch::jit::attr::value) == 5)
{
dtype_node->i_(torch::jit::attr::value, 6);
}
......
......@@ -132,6 +132,8 @@ void pass_level1(const torch::jit::Module& mod, const std::shared_ptr<torch::jit
// sub_mod.dump(true, true, true);
op->attrs["data"] = sub_mod.attr(name).toTensor();
op->outputs[0]->type = op->attrs["data"].type;
op->outputs[0]->shape = op->attrs["data"].shape;
}
}
else if (n->kind() == c10::prim::Constant) // || n->kind() == c10::prim::ListConstruct)
......
......@@ -47,6 +47,11 @@ static bool operand_maybe_tensor(const Operand* operand)
return false;
}
if (op->type == "torch.unbind" && op->inputs[0]->shape.size() == 1)
{
return false;
}
if (op->type == "aten::size")
{
return false;
......@@ -101,6 +106,10 @@ static bool operand_maybe_tensor(const Operand* operand)
|| op->type == "aten::div"
|| op->type == "aten::floor_divide"
|| op->type == "aten::fmod"
|| op->type == "aten::max"
|| op->type == "aten::maximum"
|| op->type == "aten::min"
|| op->type == "aten::minimum"
|| op->type == "aten::mul"
|| op->type == "aten::pow"
|| op->type == "aten::remainder")
......@@ -131,25 +140,7 @@ static void fuse_expression(Graph& graph, Operand* operand, std::string& expr, s
{
if (op->outputs.size() > 1 || op->outputs[0]->consumers.size() > 1)
{
auto it = std::find(inputs.begin(), inputs.end(), operand);
if (it == inputs.end())
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)inputs.size());
expr += tmp;
inputs.push_back(operand);
}
else
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)(it - inputs.begin()));
expr += tmp;
}
return;
goto DEFAULT;
}
}
......@@ -189,24 +180,169 @@ static void fuse_expression(Graph& graph, Operand* operand, std::string& expr, s
}
else
{
auto it = std::find(inputs.begin(), inputs.end(), operand);
if (it == inputs.end())
goto DEFAULT;
}
}
else if (op->type == "pnnx.Attribute")
{
// fprintf(stderr, "operand pnnx.Attribute %s\n", operand->name.c_str());
const Attribute& data = op->attrs["data"];
if (data.shape.size() == 1 && data.shape[0] == 1 && data.type != -1)
{
if (data.type == 0)
{
expr += "None";
}
else if (data.type == 1)
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)inputs.size());
sprintf(tmp, "%e", ((const float*)data.data.data())[0]);
expr += tmp;
}
else if (data.type == 2)
{
char tmp[32];
sprintf(tmp, "%e", ((const double*)data.data.data())[0]);
expr += tmp;
}
else if (data.type == 4)
{
char tmp[32];
sprintf(tmp, "%d", ((const int*)data.data.data())[0]);
expr += tmp;
}
else if (data.type == 5)
{
int64_t v = ((const int64_t*)data.data.data())[0];
if (v == std::numeric_limits<int64_t>::max()) v = INT_MAX;
if (v == std::numeric_limits<int64_t>::min()) v = INT_MIN;
inputs.push_back(operand);
char tmp[32];
sprintf(tmp, "%d", (int)v);
expr += tmp;
}
else
else if (data.type == 6)
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)(it - inputs.begin()));
sprintf(tmp, "%d", ((const short*)data.data.data())[0]);
expr += tmp;
}
else if (data.type == 7)
{
char tmp[32];
sprintf(tmp, "%d", ((const signed char*)data.data.data())[0]);
expr += tmp;
}
else if (data.type == 8)
{
char tmp[32];
sprintf(tmp, "%u", ((const unsigned char*)data.data.data())[0]);
expr += tmp;
}
else if (data.type == 9)
{
expr += ((const char*)data.data.data())[0] ? "True" : "False";
}
else
{
// unsupported type
fprintf(stderr, "fuse expression got unsupported scalar type %d\n", data.type);
}
}
else
{
goto DEFAULT;
}
}
else if (op->type == "torch.unbind")
{
// track chain
// pnnx.Attribute/foldable with 1-rank
// torch.unbind to constant scalar
Operand* operand2 = op->inputs[0];
if (operand2->producer->type == "pnnx.Attribute")
{
const Attribute& data = operand2->producer->attrs["data"];
if (data.shape.size() == 1 && data.type != -1)
{
// resolve scalar i
int si = 0;
for (size_t i = 0; i < op->outputs.size(); i++)
{
if (op->outputs[i] == operand)
{
si = (int)i;
break;
}
}
if (data.type == 0)
{
expr += "None";
}
else if (data.type == 1)
{
char tmp[32];
sprintf(tmp, "%e", ((const float*)data.data.data())[si]);
expr += tmp;
}
else if (data.type == 2)
{
char tmp[32];
sprintf(tmp, "%e", ((const double*)data.data.data())[si]);
expr += tmp;
}
else if (data.type == 4)
{
char tmp[32];
sprintf(tmp, "%d", ((const int*)data.data.data())[si]);
expr += tmp;
}
else if (data.type == 5)
{
int64_t v = ((const int64_t*)data.data.data())[si];
if (v == std::numeric_limits<int64_t>::max()) v = INT_MAX;
if (v == std::numeric_limits<int64_t>::min()) v = INT_MIN;
char tmp[32];
sprintf(tmp, "%d", (int)v);
expr += tmp;
}
else if (data.type == 6)
{
char tmp[32];
sprintf(tmp, "%d", ((const short*)data.data.data())[si]);
expr += tmp;
}
else if (data.type == 7)
{
char tmp[32];
sprintf(tmp, "%d", ((const signed char*)data.data.data())[si]);
expr += tmp;
}
else if (data.type == 8)
{
char tmp[32];
sprintf(tmp, "%u", ((const unsigned char*)data.data.data())[si]);
expr += tmp;
}
else if (data.type == 9)
{
expr += ((const char*)data.data.data())[si] ? "True" : "False";
}
else
{
// unsupported type
fprintf(stderr, "fuse expression got unsupported scalar type %d\n", data.type);
goto DEFAULT;
}
return;
}
}
goto DEFAULT;
}
else if (checksubgraph && operand_maybe_tensor(operand) && foldable_constants.find(operand->name) != foldable_constants.end())
{
......@@ -251,6 +387,9 @@ static void fuse_expression(Graph& graph, Operand* operand, std::string& expr, s
int64_t v;
zip.read_file(operand->name, (char*)&v);
if (v == std::numeric_limits<int64_t>::max()) v = INT_MAX;
if (v == std::numeric_limits<int64_t>::min()) v = INT_MIN;
char tmp[32];
sprintf(tmp, "%ld", v);
expr += tmp;
......@@ -313,23 +452,7 @@ static void fuse_expression(Graph& graph, Operand* operand, std::string& expr, s
}
else
{
auto it = std::find(inputs.begin(), inputs.end(), operand);
if (it == inputs.end())
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)inputs.size());
expr += tmp;
inputs.push_back(operand);
}
else
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)(it - inputs.begin()));
expr += tmp;
}
goto DEFAULT;
}
}
else if (op->type == "prim::NumToTensor")
......@@ -373,23 +496,7 @@ static void fuse_expression(Graph& graph, Operand* operand, std::string& expr, s
}
else
{
auto it = std::find(inputs.begin(), inputs.end(), operand);
if (it == inputs.end())
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)inputs.size());
expr += tmp;
inputs.push_back(operand);
}
else
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)(it - inputs.begin()));
expr += tmp;
}
goto DEFAULT;
}
}
else if (op->type == "aten::detach" || op->type == "aten::ScalarImplicit")
......@@ -433,6 +540,10 @@ static void fuse_expression(Graph& graph, Operand* operand, std::string& expr, s
else if (op->type == "aten::atan2"
|| op->type == "aten::floor_divide"
|| op->type == "aten::fmod"
|| op->type == "aten::max"
|| op->type == "aten::maximum"
|| op->type == "aten::min"
|| op->type == "aten::minimum"
|| op->type == "aten::mul"
|| op->type == "aten::pow"
|| op->type == "aten::remainder")
......@@ -536,23 +647,28 @@ static void fuse_expression(Graph& graph, Operand* operand, std::string& expr, s
}
else
{
auto it = std::find(inputs.begin(), inputs.end(), operand);
if (it == inputs.end())
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)inputs.size());
expr += tmp;
goto DEFAULT;
}
inputs.push_back(operand);
}
else
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)(it - inputs.begin()));
expr += tmp;
}
return;
DEFAULT:
auto it = std::find(inputs.begin(), inputs.end(), operand);
if (it == inputs.end())
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)inputs.size());
expr += tmp;
inputs.push_back(operand);
}
else
{
// tensor
char tmp[32];
sprintf(tmp, "@%d", (int)(it - inputs.begin()));
expr += tmp;
}
}
......@@ -621,6 +737,10 @@ void fuse_expression(Graph& graph, const std::set<std::string>& foldable_constan
|| op->type == "aten::fmod"
|| op->type == "aten::log"
|| op->type == "aten::log10"
|| op->type == "aten::max"
|| op->type == "aten::maximum"
|| op->type == "aten::min"
|| op->type == "aten::minimum"
|| op->type == "aten::mul"
|| op->type == "aten::neg"
|| op->type == "aten::pow"
......
......@@ -165,7 +165,10 @@ void eliminate_reshape_shape_expression(Graph& graph)
if (op_expr->outputs[0]->consumers.size() == 0)
{
// remove expression operator
op_expr->inputs[0]->remove_consumer(op_expr);
for (auto x : op_expr->inputs)
{
x->remove_consumer(op_expr);
}
Operand* op_expr_out = op_expr->outputs[0];
......
......@@ -193,6 +193,11 @@ static std::string eval_expression(const Operator* op)
if (t == "int")
{
int r = int(af);
if (token_is_interger_literal(a))
{
r = std::stoi(a);
}
exprstack.push(std::to_string(r));
}
if (t == "abs")
......@@ -339,6 +344,10 @@ static std::string eval_expression(const Operator* op)
else if (t == "atan2"
|| t == "add"
|| t == "sub"
|| t == "max"
|| t == "maximum"
|| t == "min"
|| t == "minimum"
|| t == "mul"
|| t == "div"
|| t == "floor_divide"
......@@ -371,6 +380,16 @@ static std::string eval_expression(const Operator* op)
float r = af - bf;
exprstack.push(std::to_string(r));
}
if (t == "max" || t == "maximum")
{
float r = std::max(af, bf);
exprstack.push(std::to_string(r));
}
if (t == "minimum")
{
float r = std::min(af, bf);
exprstack.push(std::to_string(r));
}
if (t == "mul")
{
float r = af * bf;
......
// Tencent is pleased to support the open source community by making ncnn available.
//
// Copyright (C) 2021 THL A29 Limited, a Tencent company. All rights reserved.
//
// Licensed under the BSD 3-Clause License (the "License"); you may not use this file except
// in compliance with the License. You may obtain a copy of the License at
//
// https://opensource.org/licenses/BSD-3-Clause
//
// Unless required by applicable law or agreed to in writing, software distributed
// under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
// CONDITIONS OF ANY KIND, either express or implied. See the License for the
// specific language governing permissions and limitations under the License.
#include "pass_ncnn.h"
namespace pnnx {
namespace ncnn {
class F_log_softmax : public GraphRewriterPass
{
public:
const char* match_pattern_graph() const
{
return R"PNNXIR(7767517
3 2
pnnx.Input input 0 1 input
F.log_softmax op 1 1 input out dim=%dim
pnnx.Output output 1 0 out
)PNNXIR";
}
const char* replace_pattern_graph() const
{
return R"PNNXIR(7767517
4 3
pnnx.Input input 0 1 input
F.softmax softmax 1 1 input softmax
UnaryOp log 1 1 softmax out 0=8
pnnx.Output output 1 0 out
)PNNXIR";
}
const char* type_str() const
{
return "F_log_softmax";
}
const char* name_str() const
{
return "f_logsoftmax";
}
void write(const std::map<std::string, Operator*>& ops, const std::map<std::string, Parameter>& captured_params, const std::map<std::string, Attribute>& captured_attrs) const
{
GraphRewriterPass::write(ops, captured_params, captured_attrs);
ops.at("softmax")->params["dim"] = captured_params.at("dim");
}
};
REGISTER_GLOBAL_PNNX_NCNN_GRAPH_REWRITER_PASS(F_log_softmax, 19)
} // namespace ncnn
} // namespace pnnx
// Tencent is pleased to support the open source community by making ncnn available.
//
// Copyright (C) 2021 THL A29 Limited, a Tencent company. All rights reserved.
//
// Licensed under the BSD 3-Clause License (the "License"); you may not use this file except
// in compliance with the License. You may obtain a copy of the License at
//
// https://opensource.org/licenses/BSD-3-Clause
//
// Unless required by applicable law or agreed to in writing, software distributed
// under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR
// CONDITIONS OF ANY KIND, either express or implied. See the License for the
// specific language governing permissions and limitations under the License.
#include "pass_ncnn.h"
namespace pnnx {
namespace ncnn {
class F_logsigmoid : public GraphRewriterPass
{
public:
const char* match_pattern_graph() const
{
return R"PNNXIR(7767517
3 2
pnnx.Input input 0 1 input
F.logsigmoid op 1 1 input out
pnnx.Output output 1 0 out
)PNNXIR";
}
const char* replace_pattern_graph() const
{
return R"PNNXIR(7767517
4 3
pnnx.Input input 0 1 input
F.sigmoid sigmoid 1 1 input sigmoid
UnaryOp log 1 1 sigmoid out 0=8
pnnx.Output output 1 0 out
)PNNXIR";
}
const char* type_str() const
{
return "F_logsigmoid";
}
const char* name_str() const
{
return "f_logsigmoid";
}
void write(const std::map<std::string, Operator*>& ops, const std::map<std::string, Parameter>& captured_params, const std::map<std::string, Attribute>& captured_attrs) const
{
GraphRewriterPass::write(ops, captured_params, captured_attrs);
}
};
REGISTER_GLOBAL_PNNX_NCNN_GRAPH_REWRITER_PASS(F_logsigmoid, 19)
} // namespace ncnn
} // namespace pnnx
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -271,6 +271,8 @@ pnnx_add_test(torch_floor)
pnnx_add_test(torch_imag)
pnnx_add_test(torch_log)
pnnx_add_test(torch_log10)
pnnx_add_test(torch_maximum)
pnnx_add_test(torch_minimum)
pnnx_add_test(torch_neg)
pnnx_add_test(torch_pow)
pnnx_add_test(torch_real)
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。
......@@ -27,9 +27,12 @@ class Model(nn.Module):
self.w3 = nn.Parameter(torch.rand(12, 15))
self.w4 = nn.Parameter(torch.rand(12, 15))
self.w5 = nn.Parameter(torch.rand(12, 15))
self.c0 = nn.Parameter(torch.ones(1))
self.c1 = nn.Parameter(torch.ones(3) + 0.2)
def forward(self, x):
x0 = x * 10
c10, c11, _ = torch.unbind(self.c1)
x0 = x * 10 + self.c0 - c11
x = x + self.w0 + x0
x = x - self.w1 + x0.float()
x = x * self.w2 + x0
......
此差异已折叠。
此差异已折叠。
此差异已折叠。
此差异已折叠。