<!-- groonga-command -->
<!-- database: tokenizers_language_model_knn -->

# `TokenLanguageModelKNN`

```{versionadded} 15.1.8

```

```{note}
This is an experimental feature. Currently, this feature is still not stable.
```

## Summary

`TokenLanguageModelKNN` is a tokenizer that supports semantic search.

Semantic search uses the k-Nearest Neighbors (k-NN) algorithm.

To enable this tokenizer, register `language_model/knn` plugin by the following command:

```shell
plugin_register language_model/knn
```

## Syntax

`TokenLanguageModelKNN` requires two parameters:

```
TokenLanguageModelKNN("model", "hf:///path/to", "code_column", "column_name")
```

`TokenLanguageModelKNN` has optional parameter:

```
TokenLanguageModelKNN("model", "hf:///path/to", \
                      "code_column", "column_name", \
                      "n_clusters", N_CLUSTERS)

TokenLanguageModelKNN("model", "hf:///path/to", \
                      "code_column", "column_name", \
                      "passage_prefix", "passage: ", \
                      "query_prefix", "query: ")

TokenLanguageModelKNN("model", "hf:///path/to", \
                      "code_column", "column_name", \
                      "centroid_column", "centroid_column_name")
```

```{versionadded} 15.1.9
{ref}`tokenizer-language-model-knn-passage-prefix` and {ref}`tokenizer-language-model-knn-query-prefix` are added.
```

```{versionadded} 15.2.1
{ref}`tokenizer-language-model-knn-centroid-column` is added.
```

## Usage

```{note}
This tokenizer can't be run with {doc}`../commands/tokenize` command.
```

This usage example shows how to set `TokenLanguageModelKNN` as `default_tokenizer`.

You need to register `language_model/knn` plugin at first:

<!-- groonga-command -->

```{include} ../../example/reference/tokenizers/language_model_knn/usage_register.md
plugin_register language_model/knn
```

Here is a schema definition and sample data.

Sample schema:

<!-- groonga-command -->

```{include} ../../example/reference/tokenizers/language_model_knn/usage_setup_schema.md
table_create --name Memos --flags TABLE_NO_KEY
column_create \
  --table Memos \
  --name content \
  --flags COLUMN_SCALAR \
  --type ShortText
```

Sample data:

<!-- groonga-command -->

```{include} ../../example/reference/tokenizers/language_model_knn/usage_setup_data.md
load --table Memos
[
{"content": "I am a boy."},
{"content": "This is an apple."},
{"content": "Groonga is a full text search engine."}
]
```

You need to store embedding information for each record. Here is how to create that column.

<!-- groonga-command -->

```{include} ../../example/reference/tokenizers/language_model_knn/column_create.md
column_create Memos embedding_code COLUMN_SCALAR ShortBinary
```

Create an index for semantic search.

Specify `TokenLanguageModelKNN` as the tokenizer.

<!-- groonga-command -->

```{include} ../../example/reference/tokenizers/language_model_knn/index_column_create.md
table_create Centroids TABLE_HASH_KEY ShortBinary \
  --default_tokenizer \
    'TokenLanguageModelKNN("model", "hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF", \
                           "code_column", "embedding_code")'

column_create Centroids data_content COLUMN_INDEX Memos content
```

You can see that the embedding has been generated by fetching `Memos` table.
The generated bytecode is saved.

<!-- groonga-command -->

```{include} ../../example/reference/tokenizers/language_model_knn/select.md
select Memos
```

Users do not operate on this `embedding_code`. Groonga uses it internally for semantic search.

## Parameters

### Required parameters

(tokenizer-language-model-knn-model)=

#### `model`

Specify the language model to use.
You can specify a Hugging Face URI for `model`.

At the first index creation, it automatically downloads and places the model in the directory of Groonga's database.
After that, it uses the locally located model.

Example of URI: `hf:///groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF` for `https://huggingface.co/groonga/all-MiniLM-L6-v2-Q4_K_M-GGUF`.

See also {doc}`../language_model` for language model.

(tokenizer-language-model-knn-code-column)=

#### `code_column`

Specify the column for storing the embedding.

Create a column in the table storing the searchable text and specify its column name.

### Optional parameter

(tokenizer-language-model-knn-passage-prefix)=

#### `passage_prefix`

```{versionadded} 15.1.9

```

Some models such as multilingual-e5 require prefix used for search-target texts and query texts.

`passage_prefix` specifies the prefix for search target text.

For example, you can set `passage: ` prefix in search target text and `query: ` prefix in query text as below.

```
TokenLanguageModelKNN("model", "hf:///groonga/multilingual-e5-base-Q4_K_M-GGUF", \
                      "code_column", "embedding_code", \
                      "passage_prefix", "passage: ", \
                      "query_prefix", "query: ")
```

(tokenizer-language-model-knn-query-prefix)=

#### `query_prefix`

```{versionadded} 15.1.9

```

Some models such as multilingual-e5 require prefix used for search-target texts and query texts.

`query_prefix` specifies the prefix for query text.

(tokenizer-language-model-knn-centroid-column)=

#### `centroid_column`

```{versionadded} 15.2.1

```

This option is for large embeddings with more than 1025 dimensions (more than 4100 bytes).
Groonga table keys must be 4 KiB or smaller. Embeddings larger than 4 KiB cannot be stored as keys.

You can use this option to store large embeddings in a column instead of in the table key.

Execution example:

```
table_create LargeCentroids TABLE_HASH_KEY UInt32 \
  --default_tokenizer \
      'TokenLanguageModelKNN("model", "hf:///groonga/multilingual-e5-base-Q4_K_M-GGUF", \
                             "centroid_column", "centroid", \
                             "code_column", "embedding_code", \
                             "passage_prefix", "passage: ", \
                             "query_prefix", "query: ")'

column_create LargeCentroids centroid COLUMN_VECTOR Float32
column_create LargeCentroids data_content COLUMN_INDEX Memos content
```

Specify the column for storing embeddings and add it to the table for the index.

(tokenizer-language-model-knn-n-clusters)=

#### `n_clusters`

Specify the number of clusters to use as indexes.
If not specified, an appropriate value will be set automatically.

In most cases, you don't need to set this option explicitly.

#### `n_gpu_layers`

```{versionadded} 15.2.1

```

Specify the number of GPU layers to use for language model.
If not specified, Groonga uses GPU as much as possible.

In most cases, you don't need to set this option explicitly.

To disable GPU usage, set `n_gpu_layers` to 0.

## See also

- {doc}`../language_model`
- {doc}`../commands/tokenize`
- {doc}`../functions/language_model_knn`
