README.md

    Jina banner

    An easier way to build neural search in the cloud

    Jina Python 3.7 3.8 3.9 PyPI Docker Image Version (latest semver) CI CD codecov

    Jina is a deep learning-powered search framework for building cross-/multi-modal search systems (e.g. text, images, video, audio) in the cloud.

    Time Saver - The design pattern of neural search systems, from zero to a production-ready system in minutes.

    🌌 Universal Search - Large-scale indexing and querying of unstructured data: video, image, long/short text, music, source code, etc.

    🧠 First-Class AI Models - First-class support for state-of-the-art AI models.

    Cloud Ready - Decentralized architecture with cloud-native features out-of-the-box: containerization, microservice, scaling, sharding, async IO, REST, gRPC, WebSocket.

    🧩 Plug & Play - Easily usable and extendable with a Pythonic interface.

    Made with Love - Lean dependencies (only 6!) & tip-top, never compromises on quality, maintained by a passionate full-time, venture-backed team.


    DocsHello WorldQuick StartLearnExamplesContributeJobsWebsiteSlack

    Installation

    📦
    x86/64,arm/v6,v7,v8 (Apple M1)
    On Linux/macOS & Python 3.7/3.8/3.9 Docker Users
    Standard pip install -U jina docker run jinaai/jina:latest
    Daemon pip install -U "jina[daemon]" docker run --network=host jinaai/jina:latest-daemon
    With Extras pip install -U "jina[devel]" docker run jinaai/jina:latest-devel
    Dev/Pre-Release pip install --pre jina docker run jinaai/jina:master

    Version identifiers are explained here. To install Jina with extra dependencies please refer to the docs. Jina can run on Windows Subsystem for Linux. We welcome the community to help us with native Windows support.

    Jina "Hello, World!" 👋🌍

    Just starting out? Try Jina's "Hello, World" - a simple image neural search demo for Fashion-MNIST. No extra dependencies needed, simply run:

    jina hello-world

    ...or even easier for Docker users, no install required:

    docker run -v "$(pwd)/j:/j" jinaai/jina hello-world --workdir /j && open j/hello-world.html  
    # replace "open" with "xdg-open" on Linux
    Click here to see console output

    hello world console output

    This downloads the Fashion-MNIST training and test dataset and tells Jina to index 60,000 images from the training set. Then it randomly samples images from the test set as queries and asks Jina to retrieve relevant results. The whole process takes about 1 minute, and after running opens a webpage and shows results:

    Jina banner

    Intrigued? Play with different options:

    jina hello-world --help

    Get Started

    🐣 CreateVisualizeFeed DataFetch ResultConstruct DocumentAdd LogicInter & Intra ParallelismDecentralizeAsynchronous
    🚀 Customize EncoderTest EncoderParallelism & BatchingAdd Data IndexerCompose Flow from YAMLSearchEvaluationREST Interface

    Create

    Jina provides a high-level Flow API to simplify building search/index workflows. To create a new Flow:

    from jina import Flow
    f = Flow().add()

    This creates a simple Flow with one Pod. You can chain multiple .add()s in a single Flow.

    Visualize

    To visualize the Flow, simply chain it with .plot('my-flow.svg'). If you are using a Jupyter notebook, the Flow object will be automatically displayed inline without plot:

    Gateway is the entrypoint of the Flow.

    Feed Data

    Let's create some random data and index it:

    import numpy 
    from jina import Document
    
    with Flow().add() as f:
        f.index((Document() for _ in range(10)))  # index raw Jina Documents
        f.index_ndarray(numpy.random.random([4,2]), on_done=print)  # index ndarray data, document sliced on first dimension
        f.index_lines(['hello world!', 'goodbye world!'])  # index textual data, each element is a document
        f.index_files(['/tmp/*.mp4', '/tmp/*.pdf'])  # index files and wildcard globs, each file is a document

    To use a Flow, open it using the with context manager, like you would a file in Python. You can call index and search with nearly all types of data. The whole data stream is asynchronous and efficient.

    Fetch Result

    Once a request is done, callback functions are fired. Jina Flow implements Promise-like interface, you can add callback functions on_done, on_error, on_always to hook different event. In the example below, our Flow passes the message then prints the result when success. If something wrong, it beeps. Finally, the result is written to output.txt.

    def beep(*args):
        # make a beep sound
        import os
        os.system('echo -n "\a";')
    
    with Flow().add() as f, open('output.txt', 'w') as fp:
        f.index(numpy.random.random([4,5,2]),
                on_done=print, on_error=beep, on_always=lambda x: fp.write(x.to_json()))

    Construct Document

    Document is Jina's primitive data type. It can contain text, image, array, embedding, URI, and accompanied by rich meta information. It can be recurred both vertically and horizontally to have nested documents and matched documents. To construct a Document, one can use:

    import numpy
    from jina import Document
    
    doc1 = Document(content=text_from_file, mime_type='text/x-python')  # a text document contains python code
    doc2 = Document(content=numpy.random.random([10, 10]))  # a ndarray document
    doc1.chunks.append(doc2)  # doc2 is now a sub-document of doc1
    Click here to see more about MultimodalDocument

    MultimodalDocument

    A MultimodalDocument is a document composed of multiple Document from different modalities (e.g. text, image, audio).

    Jina provides multiple ways to build a multimodal Document. For example, one can provide the modality names and the content in a dict:

    from jina import MultimodalDocument
    document = MultimodalDocument(modality_content_map={
        'title': 'my holiday picture',
        'description': 'the family having fun on the beach',
        'image': PIL.Image.open('path/to/image.jpg')
    })

    One can also compose a MultimodalDocument from multiple Document directly:

    from jina.types import Document, MultimodalDocument
    
    doc_title = Document(content='my holiday picture', modality='title')
    doc_desc = Document(content='the family having fun on the beach', modality='description')
    doc_img = Document(content=PIL.Image.open('path/to/image.jpg'), modality='description')
    doc_img.tags['date'] = '10/08/2019' 
    
    document = MultimodalDocument(chunks=[doc_title, doc_description, doc_img])
    Fusion Embeddings from Different Modalities

    To extract fusion embeddings from different modalities Jina provides BaseMultiModalEncoder abstract class, which has a unqiue encode interface.

    def encode(self, *data: 'numpy.ndarray', **kwargs) -> 'numpy.ndarray':
        ...

    MultimodalDriver provides data to the MultimodalDocument in the correct expected order. In this example below, image embedding is passed to the endoder as the first argument, and text as the second.

    !MyMultimodalEncoder
    with:
      positional_modality: ['image', 'text']
    requests:
      on:
        [IndexRequest, SearchRequest]:
          - !MultiModalDriver {}

    Interested readers can refer to jina-ai/example: how to build a multimodal search engine for image retrieval using TIRG (Composing Text and Image for Image Retrieval) for the usage of MultimodalDriver and BaseMultiModalEncoder in practice.

    Add Logic

    To add logic to the Flow, use the uses parameter to attach a Pod with an Executor. uses accepts multiple value types including class name, Docker image, (inline) YAML or built-in shortcut.

    f = (Flow().add(uses='MyBertEncoder')  # class name of a Jina Executor
               .add(uses='docker://jinahub/pod.encoder.dummy_mwu_encoder:0.0.6-0.9.3')  # the image name
               .add(uses='myencoder.yml')  # YAML serialization of a Jina Executor
               .add(uses='!WaveletTransformer | {freq: 20}')  # inline YAML config
               .add(uses='_pass')  # built-in shortcut executor
               .add(uses={'__cls': 'MyBertEncoder', 'with': {'param': 1.23}}))  # dict config object with __cls keyword

    The power of Jina lies in its decentralized architecture: each add creates a new Pod, and these Pods can be run as a local thread/process, a remote process, inside a Docker container, or even inside a remote Docker container.

    Inter & Intra Parallelism

    Chaining .add()s creates a sequential Flow. For parallelism, use the needs parameter:

    f = (Flow().add(name='p1', needs='gateway')
               .add(name='p2', needs='gateway')
               .add(name='p3', needs='gateway')
               .needs(['p1','p2', 'p3'], name='r1').plot())

    p1, p2, p3 now subscribe to Gateway and conduct their work in parallel. The last .needs() blocks all Pods until they finish their work. Note: parallelism can also be performed inside a Pod using parallel:

    f = (Flow().add(name='p1', needs='gateway')
               .add(name='p2', needs='gateway')
               .add(name='p3', parallel=3)
               .needs(['p1','p3'], name='r1').plot())

    Decentralized Flow

    A Flow does not have to be local-only, one can put any Pod to remote(s). In the example below, with the host keyword gpu-pod is put to a remote machine for parallelization, whereas other pods stay local. Extra file dependencies that need to be uploaded are specified via the upload_files keyword.

    123.456.78.9
    # have docker installed
    docker run --name=jinad --network=host -v /var/run/docker.sock:/var/run/docker.sock jinaai/jina:latest-daemon --port-expose 8000
    # to stop it
    docker rm -f jinad
    Local
    import numpy as np
    from jina import Flow
    
    f = (Flow()
         .add()
         .add(name='gpu_pod',
              uses='mwu_encoder.yml',
              host='123.456.78.9:8000',
              parallel=2,
              upload_files=['mwu_encoder.py'])
         .add())
    
    with f:
        f.index_ndarray(np.random.random([10, 100]), output=print)

    We provide a demo server on cloud.jina.ai:8000, give the following snippet a try!

    from jina import Flow
    
    with Flow().add().add(host='cloud.jina.ai:8000') as f:
        f.index(['hello', 'world'])

    Asynchronous Flow

    Synchronous from outside, Jina runs asynchronously underneath: it manages the eventloop(s) for scheduling the jobs. In some scenario, user wants more control over the eventloop, then AsyncFlow comes to use. In the example below, Jina is part of the integration where another heavy-lifting job is running concurrently:

    from jina import AsyncFlow
    
    async def run_async_flow_5s():  # WaitDriver pause 5s makes total roundtrip ~5s
        with AsyncFlow().add(uses='- !WaitDriver {}') as f:
            await f.index_ndarray(numpy.random.random([5, 4]), on_done=validate)
    
    async def heavylifting():  # total roundtrip takes ~5s
        print('heavylifting other io-bound jobs, e.g. download, upload, file io')
        await asyncio.sleep(5)
        print('heavylifting done after 5s')
    
    async def concurrent_main():  # about 5s; but some dispatch cost, can't be just 5s, usually at <7s
        await asyncio.gather(run_async_flow_5s(), heavylifting())
    
    if __name__ == '__main__':
        asyncio.run(concurrent_main())

    AsyncFlow is very useful when using Jina inside Jupyter Notebook. As Jupyter/ipython already manages an eventloop and thanks to autoawait, the following code can run out-of-the-box in Jupyter:

    from jina import AsyncFlow
    
    with AsyncFlow().add() as f:
        await f.index_ndarray(numpy.random.random([5, 4]), on_done=print)

    That's all you need to know for understanding the magic behind hello-world. Now let's dive into it!

    Breakdown of hello-world

    🐣 CreateVisualizeFeed DataFetch ResultConstruct DocumentAdd LogicInter & Intra ParallelismDecentralizeAsynchronous
    🚀 Customize EncoderTest EncoderParallelism & BatchingAdd Data IndexerCompose Flow from YAMLSearchEvaluationREST Interface

    Customize Encoder

    Let's first build a naive image encoder that embeds images into vectors using an orthogonal projection. To do this, we simply inherit from BaseImageEncoder: a base class from the jina.executors.encoders module. We then override its __init__() and encode() methods.

    import numpy as np
    from jina.executors.encoders import BaseImageEncoder
    
    class MyEncoder(BaseImageEncoder):
    
        def __init__(self, *args, **kwargs):
            super().__init__(*args, **kwargs)
            np.random.seed(1337)
            H = np.random.rand(784, 64)
            u, s, vh = np.linalg.svd(H, full_matrices=False)
            self.oth_mat = u @ vh
    
        def encode(self, data: 'np.ndarray', *args, **kwargs):
            return (data.reshape([-1, 784]) / 255) @ self.oth_mat

    Jina provides a family of Executor classes, which summarize frequently-used algorithmic components in neural search. This family consists of encoders, indexers, crafters, evaluators, and classifiers, each with a well-designed interface. You can find the list of all 107 built-in executors here. If they don't meet your needs, inheriting from one of them is the easiest way to bootstrap your own Executor. Simply use our Jina Hub CLI:

    pip install jina[hub] && jina hub new

    Test Encoder in Flow

    Let's test our encoder in the Flow with some synthetic data:

    def validate(req):
        assert len(req.docs) == 100
        assert NdArray(req.docs[0].embedding).value.shape == (64,)
    
    f = Flow().add(uses='MyEncoder')
    
    with f:
        f.index_ndarray(numpy.random.random([100, 28, 28]), on_done=validate)

    All good! Now our validate function confirms that all one hundred 28x28 synthetic images have been embedded into 100x64 vectors.

    Parallelism & Batching

    By setting a larger input, you can play with batch_size and parallel:

    f = Flow().add(uses='MyEncoder', parallel=10)
    
    with f:
        f.index_ndarray(numpy.random.random([60000, 28, 28]), batch_size=1024)

    Add Data Indexer

    Now we need to add an indexer to store all the embeddings and the image for later retrieval. Jina provides a simple numpy-powered vector indexer NumpyIndexer, and a key-value indexer BinaryPbIndexer. We can combine them in a single YAML file:

    !CompoundIndexer
    components:
      - !NumpyIndexer
        with:
          index_filename: vec.gz
      - !BinaryPbIndexer
        with:
          index_filename: chunk.gz
    metas:
      workspace: ./
    • ! tags a structure with a class name
    • with defines arguments for initializing this class object.

    Essentially, the above YAML config is equivalent to the following Python code:

    from jina.executors.indexers.vector import NumpyIndexer
    from jina.executors.indexers.keyvalue import BinaryPbIndexer
    from jina.executors.indexers import CompoundIndexer
    
    a = NumpyIndexer(index_filename='vec.gz')
    b = BinaryPbIndexer(index_filename='vec.gz')
    c = CompoundIndexer()
    c.components = lambda: [a, b]

    Compose Flow from YAML

    Now let's add our indexer YAML file to the Flow with .add(uses=). Let's also add two shards to the indexer to improve its scalability:

    f = Flow().add(uses='MyEncoder', parallel=2).add(uses='myindexer.yml', shards=2).plot()

    When you have many arguments, constructing a Flow in Python can get cumbersome. In that case, you can simply move all arguments into one flow.yml:

    !Flow
    version: '1.0'
    pods:
      - name: encode
        uses: MyEncoder
        parallel: 2
      - name:index
        uses: myindexer.yml
        shards: 2

    And then load it in Python:

    f = Flow.load_config('flow.yml')

    Search

    Querying a Flow is similar to what we did with indexing. Simply load the query Flow and switch from f.index to f.search. Say you want to retrieve the top 50 documents that are similar to your query and then plot them in HTML:

    f = Flow.load_config('flows/query.yml')
    with f:
        f.search_ndarray(numpy.random.random([10, 28, 28]), shuffle=True, on_done=plot_in_html, top_k=50)

    Evaluation

    To compute precision recall on the retrieved result, you can add _eval_pr, a built-in evaluator for computing precision & recall.

    f = (Flow().add(...)
               .add(uses='_eval_pr'))

    You can construct an iterator of query and groundtruth pairs and feed to the flow f, via:

    from jina import Document
    
    def query_generator():
        for _ in range(10):
            q = Document()
            # now construct expect matches as groundtruth
            gt = Document(q, copy=True)  # make sure 'gt' is identical to 'q'
            gt.matches.append(...)
            yield q, gt
            
    f.search(query_iterator, ...)

    REST Interface

    In practice, the query Flow and the client (i.e. data sender) are often physically seperated. Moreover, the client may prefer to use a REST API rather than gRPC when querying. You can set port_expose to a public port and turn on REST support with restful=True:

    f = Flow(port_expose=45678, restful=True)
    
    with f:
        f.block()

    That is the essense behind jina hello-world. It is merely a taste of what Jina can do. We’re really excited to see what you do with Jina! You can easily create a Jina project from templates with one terminal command:

    pip install jina[hub] && jina hub new --type app

    This creates a Python entrypoint, YAML configs and a Dockerfile. You can start from there.

    Learn

    Jina 101 Concept Illustration Book, Copyright by Jina AI Limited   

    Jina 101: First Things to Learn About Jina

      English日本語FrançaisPortuguêsDeutschРусский язык中文عربية

    Examples (View all)

    Example code to build your own projects

    📄

    My First Jina App

    Brand new to neural search? Not for long! Use cookiecutter to search through Star Trek scripts using Jina

    📄

    Build a NLP Semantic Search System with Transformers

    Upgrade from plain search to sentence search and practice your Flows and Pods by searching South Park scripts

    📄

    Search Lyrics with Transformers and PyTorch

    Get a better understanding of chunks by searching a lyrics database. Now with shiny front-end!

    🖼

    Google's Big Transfer Model in (Poké-)Production

    Use SOTA visual representation for searching Pokémon!

    🖼

    Object detection with fasterrcnn and MobileNetV2

    Detect, index and query similar objects

    🎧

    Search YouTube audio data with Vggish

    A demo of neural search for audio data based Vggish model.

    🎞

    Search Tumblr GIFs with KerasEncoder

    Use prefetching and sharding to improve the performance of your index and query flow when searching animated GIFs.

    Please check our examples repo for advanced and community-submitted examples.

    Want to read more? Check our Founder Han Xiao's blog and our official blog.

    Documentation

    Apart from the learning resources we provided above, We highly recommended you go through our documentation to master Jina.

    Our docs are built on every push, merge, and release of Jina's master branch. Documentation for older versions is archived here.

    Are you a "Doc"-star? Join us! We welcome all kinds of improvements on the documentation.

    Contributing

    We welcome all kinds of contributions from the open-source community, individuals and partners. We owe our success to your active involvement.

    Contributors

    All Contributors

    Community

    • Code of conduct - play nicely with the Jina community
    • Slack workspace - join #general on our Slack to meet the team and ask questions
    • YouTube channel - subscribe to the latest video tutorials, release demos, webinars and presentations.
    • LinkedIn - get to know Jina AI as a company and find job opportunities
    • Twitter Follow - follow and interact with us using hashtag #JinaSearch
    • Company - know more about our company and how we are fully committed to open-source.

    Open Governance

    GitHub milestones lay out the path to Jina's future improvements.

    As part of our open governance model, we host Jina's Engineering All Hands in public. This Zoom meeting recurs monthly on the second Tuesday of each month, at 14:00-15:30 (CET). Everyone can join in via the following calendar invite.

    The meeting will also be live-streamed and later published to our YouTube channel.

    Join Us

    Jina is an open-source project. We are hiring full-stack developers, evangelists, and PMs to build the next neural search ecosystem in open source.

    License

    Copyright (c) 2020 Jina AI Limited. All rights reserved.

    Jina is licensed under the Apache License, Version 2.0. See LICENSE for the full license text.

    项目简介

    Cloud-native neural search framework for 𝙖𝙣𝙮 kind of data

    🚀 Github 镜像仓库 🚀

    源项目地址

    https://github.com/jina-ai/jina

    发行版本 148

    💫 Patch v2.6.4

    全部发行版

    贡献者 62

    全部贡献者

    开发语言

    • Python 96.2 %
    • HTML 1.8 %
    • Shell 0.9 %
    • Dockerfile 0.6 %
    • CSS 0.3 %