Sparse Vectors

set serveroutput on;
declare
    v_sparse vector(*, int8, sparse);
    v_dense_text varchar2(100);
begin
    --construct a sparse vector from textual input
    v_sparse := to_vector('[16,[4,6],[7,9]]', *, int8, sparse);
    
    select vector_serialize(v_sparse returning clob format dense)
    into v_dense_text;
    dbms_output.put_line('Serialized to dense vector: ' || v_dense_text);
    
end;
/

Serialized to dense vector: [0,0,0,0,7,0,9,0,0,0,0,0,0,0,0,0]

Sparse vector storage format was introduced in Oracle 23ai release 6. The idea of a sparse vector is to reduce the storage requirements of vectors that have mostly zero valued dimensions. Models that produce these embeddings generally use a lexical approach (called Bag of Words), where dimensions represent vocabulary tokens from the encoded text or expanded words with similar meanings. The value of each dimension is the priority weight of that token for the overall encoded text. Because of this approach, most dimensions of a given sparse vector will have 0 values.

Sparse vectors are generated by lexical models in the SPLADE family (Sparse Lexical and Expansion models) or based on BM25. Look up these models to learn more about the construction and meaning of sparse vectors.

Reviewing the literature, there are actually variety of sparse vector formats. In Python they may be represented with PyTorch tensors, numpy arrays or dictionary structures. Oracle AI Database uses a standardized format for sparse vectors, each vector has an array of populated dimension indices and an array of corresponding values. The textual format can optionally include the total dimension count as well:

[ optional dimension count, [ populated dimension indices ], [ dimension values ] ]

In the above example, only 2 of 16 dimensions have values, so there are two entries in each array: [4, 6] indicates that the 5th and 7th dimensions are populated, [7, 9] are the values for these dimensions. Like the elements of a json array, vector dimensions are numbered starting from zero.

A dense vector would fill in all of the zeros, requiring more storage. Instead of the sparse notation [16, [4, 6], [7, 9]] the same vector in dense format would be [0,0,0,0,7,0,9,0,0,0,0,0,0,0,0,0]. While the storage savings for this 16 dimensional vector are obvious, they are much greater for sparse vectors with a large number of dimensions. The SPLADE family of sparse encoders typically generate sparse vectors with 35,200 dimensions that only populate a few hundred actual dimensions.

  1. Getting Started
  2. Tables with Sparse Vectors
  3. Inserting Sparse Vectors
  4. No Sparse Vector Indexes
  5. Constructing Sparse Vectors Using Textual Input
  6. Constructing Sparse Vectors From Other Vectors
  7. Constructing Sparse Vectors in PL/SQL
  8. Serializing Sparse Vectors
  9. Serialization in PL/SQL
  10. Other Vector Functions
  11. Sparse Vector Arithmetic
  12. Sparse Vector Aggregation
  13. Zero Based Vector Dimension Indices
  14. Conclusion

Getting Started

The source code for all articles here on practicalplsql.org is available in the practicalplsql public repository on github in the scripts/articles directory organized by concept and article. You can use create.user.practicalplsql.sql to create the practicalplsql schema. Sample data for articles can be found in the scripts/examples directory organized by example name.

Scripts for this post are located in sparse-vectors. This post requires a minimum version of Oracle 23ai release 9. All scripts have been tested in OCI Autonomous databases and the Oracle 23ai Free database VirtualBox appliance where this release update is available.

Tables with Sparse Vectors

We can use the sparse keyword after the dimension format to declare a table column as a sparse vector. The default vector storage format is dense, specifying nothing will create a dense vector as will using the dense keyword.

create.table.sparse_vector_test.sql

create table sparse_vector_test(
    v_sparse  vector(8, int8, sparse),
    v_dense   vector(8, int8, dense),
    v_default vector(8, int8)
)
/

describe sparse_vector_test;

Name      Null? Type                  
--------- ----- --------------------- 
V_SPARSE        VECTOR(8,INT8,SPARSE) 
V_DENSE         VECTOR(8,INT8,DENSE)  
V_DEFAULT       VECTOR(8,INT8,DENSE)  

Alter table ddl statements cannot be used to change a vector column between sparse and dense storage formats:

alter.table.sparse_vector_test.sql

alter table sparse_vector_test
modify v_dense vector(8, int8, sparse);

ORA-51859: Unsupported VECTOR column modification.

alter table sparse_vector_test
modify v_sparse vector(8, int8, dense);

ORA-51859: Unsupported VECTOR column modification.

The only way to change the column storage format would be to drop and recreate the column with the desired format. Actually, any modification of a vector column’s dimension count or dimension format will also generate the ORA-51859 error.

Inserting Sparse Vectors

When the storage format of a vector column is defined as sparse, we can directly insert text to the column if it is formatted correctly as sparse vector textual input. If the column definition includes the dimension count, the textual input can omit the dimension count.

We can also insert constructed vectors into the column. In this case it doesn’t matter if the vector was constructed as a sparse or dense vector, the constructed vector can be implicitly converted to the sparse storage format on insert.

inserting-sparse-vectors.sql

create table sparse_vectors(
    id integer generated always as identity primary key,
    embedding vector(8, int8, sparse),
    notes varchar2(100)
)
/

begin

    insert into sparse_vectors(
        embedding
        , notes
    ) values (
        '[8,[0,1,4],[1,2,5]]'
        , 'sparse text input with dimension count');
        
    insert into sparse_vectors(
        embedding
        , notes
    ) values (
        '[[0,1,4],[1,2,5]]'
        , 'sparse text input without dimension count');        
    
    insert into sparse_vectors(
        embedding
        , notes
    ) values (
        to_vector('[8,[0,1,4],[1,2,5]]', *, *, sparse)
        , 'constructed sparse vector');
    
    insert into sparse_vectors(
        embedding
        , notes
    ) values (
        to_vector('[1,2,0,0,5,0,0,0]', *, *, dense)
        , 'constructed dense vector');
    
    commit;

end;
/

select
    from_vector(embedding) as serialized
    , notes
from sparse_vectors
/

SERIALIZED                NOTES                                     
------------------------- ------------------------------------------
[8,[0,1,4],[1,2,5]]       sparse text input with dimension count    
[8,[0,1,4],[1,2,5]]       sparse text input without dimension count 
[8,[0,1,4],[1,2,5]]       constructed sparse vector                 
[8,[0,1,4],[1,2,5]]       constructed dense vector  

Dense textual input cannot be directly inserted to a sparse vector column. There are no supported implicit conversions between textual input formats.

insert into sparse_vectors(
    embedding
    , notes
) values (
    '[1,2,0,0,5,0,0,0]'
    , 'inserted as dense textual input');

SQL Error: ORA-51833: Textual input conversion between sparse and dense vector is not supported.

No Sparse Vector Indexes

Creation of vector indexes on sparse vector columns is explicitly not supported. Because sparse vectors don’t support the aggregate avg method, there would not be any way to calculate the centroids as the average centers of vector clusters.

sparse-vector-indexes.sql

create vector index sparse_vectors_ivf 
    on sparse_vectors (embedding) 
    organization neighbor partitions
    distance cosine
    with target accuracy 90;
    
ORA-51839: Vector index cannot be created on top of SPARSE vector columns.

Constructing Sparse Vectors Using Textual Input

Sparse vectors can be constructed from textual input using the vector constructor, sparse textual input and the sparse keyword after the dimension format.

to_vector( textual_input, dimension_count, dimension_format, sparse)

Vectors are dense by default, without the sparse storage format keyword, the vector constructor assumes a dense format is provided. Providing sparse textual input without specifying sparse storage format raises an exception:

construct-sparse-vector-from-text.sql

select from_vector(v) as v_serialized
from
    (
    select 
        to_vector(
            '[8,[3, 5],[3.42, 5.555]]'
            ) as v
    )
/

ORA-51833: Textual input conversion between sparse and dense vector is not supported.

Providing dense textual input while specifying sparse storage format will also raise an exception:

select from_vector(v) as v_serialized
from
    (
    select 
        to_vector(
            '[0,0,0,3.42000008E+000,0,5.55499983E+000,0,0]'
            , 8
            , float32
            , sparse
            ) as v
    )
/

ORA-51833: Textual input conversion between sparse and dense vector is not supported.

Optionally, the textual input can begin with the number of dimensions in the vector. If the dimension count is present in the textual input, the dimension count argument for the constructor can be represented with a wildcard:

select from_vector(v) as v_serialized
from
    (
    select 
        to_vector(
            '[8,[3, 5],[3.42, 5.555]]'
            , *
            , float32
            , sparse) as v
    )
/

V_SERIALIZED                                      
--------------------------------------------------
[8,[3,5],[3.42000008E+000,5.55499983E+000]]

If dimension count is not in the textual input, it must be present as an argument to the constructor.

select from_vector(v) as v_serialized
from
    (
    select 
        to_vector(
            '[[2, 4],[2.42, 4.42]]'
            , 8
            , float32
            , sparse) as v
    )
/

V_SERIALIZED                                      
--------------------------------------------------
[8,[2,4],[2.42000008E+000,4.42000008E+000]]

The dimension count must be provided in the text or as an argument to the constructor. Using sparse textual input without the dimension count and a wildcard for dimension count generates an exception:

select from_vector(v) as v_serialized
from
    (
    select 
        to_vector(
            '[[2, 4],[2.42, 4.42]]'
            , *
            , float32
            , sparse) as v
    )
/

ORA-51819: Sparse vector input is missing dimension count.

Constructing Sparse Vectors From Other Vectors

In SQL, the vector constructor supports conversion of storage formats between sparse vectors and dense vectors by constructing a new vector from a existing vector.

We can construct a sparse vector from a dense vector:

construct-sparse-vector-from-vectors.sql

select from_vector(v) as v_serialized
from
    (
    select 
        to_vector(
            to_vector(
                '[0,0,3,0,7,0,0,0]'
                , 8
                , int8
                , dense
                )
            , 8
            , int8
            , sparse
            ) as v
    )
/

V_SERIALIZED                                      
--------------------------------------------------
[8,[2,4],[3,7]]

We can also construct dense vectors from sparse vectors:

select from_vector(v) as v_serialized
from
    (
    select 
        to_vector(
            to_vector(
                '[8,[0,3,5],[11,33,77]]'
                , 8
                , int8
                , sparse
                )
            , 8
            , int8
            , dense
            ) as v
    )
/

V_SERIALIZED                                      
--------------------------------------------------
[11,0,0,33,0,77,0,0]

Constructing Sparse Vectors in PL/SQL

In PL/SQL we can construct sparse vectors using sparse textual input using the to_vector constructor and the sparse keyword. This works identically to SQL when the textual input includes the dimension count:

construct-sparse-vector-plsql.sql

declare
    v vector;
begin
    v := vector('[8,[0,3,5],[11,33,77]]',*,int8,sparse);
    dbms_output.put_line(from_vector(v));
end;
/

[8,[0,3,5],[11,33,77]]

If we omit the optional dimension count from the textual input we get an unknown processing error. This appears to be a bug in the PL/SQL implementation.

declare
    v vector;
begin
    v := to_vector('[[0,3,5],[11,33,77]]',8,int8,sparse);
    dbms_output.put_line(from_vector(v));
end;
/
ORA-51862: VECTOR library processing error in 'vectorliberr=LVECTOR_UNKNOWN_ERR/-221/0 from pevm_VEC_CONS'

At this time, constructing a vector from another vector directly in PL/SQL raises an exception. The PL/SQL implementation of the vector constructor won’t accept a constructed vector as the first argument.

declare
    v_dense vector;
    v_sparse vector;
begin
    v_dense := to_vector('[0,0,3,0,7,0,0,0]',*,int8);
    v_sparse := to_vector(v_dense,*,int8, sparse);
    dbms_output.put_line(from_vector(v_sparse));
exception
    when others then
        dbms_output.put_line(sqlerrm);
end;
/

PLS-00306: wrong number or types of arguments in call to 'TO_VECTOR'

To construct the sparse vector from an existing vector in PL/SQL we need to use SQL. The optional FROM clause simplifies this syntax, but there is still a required context switch between the PL/SQL engine and the SQL engine.

declare
    v_dense vector;
    v_sparse vector;
begin
    v_dense := to_vector('[0,0,3,0,7,0,0,0]',*,int8);
    select to_vector(v_dense,*,int8, sparse)
    into v_sparse;
    dbms_output.put_line(from_vector(v_sparse));
exception
    when others then
        dbms_output.put_line(sqlerrm);
end;
/

[8,[2,4],[3,7]]

Serializing Sparse Vectors

The default behavior for serialization is to preserve the current storage format. Using from_vector() or vector_serialize() with no format argument on a sparse vector will return a sparse textual vector:

serialize-sparse-vectors.sql

select from_vector(v) as v_serialized
from
    (
    select 
        to_vector(
            '[8,[0,3,5],[11,33,77]]'
            , 8
            , int8
            , sparse
            ) as v
    )
/

V_SERIALIZED                                      
--------------------------------------------------
[8,[2,4],[3,7]]

We can use the format clause to change the storage format of the serialized vector. In this case, the returning clause does not have to be specified unless the size of the resulting text would exceed the default varchar2 size and a clob needs to be returned.

from_vector( vector optional_returning_clause optional_format_clause )

Specifying format dense will generate a dense serialized vector, converting as needed from the input vector’s storage format if it was sparse.

select from_vector(v format dense) as v_serialized
from
    (
    select 
        to_vector(
            '[8,[0,3,5],[11,33,77]]'
            , 8
            , int8
            , sparse
            ) as v
    )
/

V_SERIALIZED                                      
--------------------------------------------------
[11,0,0,33,0,77,0,0]

Specifying format sparse will generate a sparse serialized vector, converting as needed from the input vector’s format.

select from_vector(v format sparse) as v_serialized
from
    (
    select 
        to_vector(
            '[11,0,0,33,0,77,0,0]'
            , 8
            , int8
            , dense
            ) as v
    )
/

V_SERIALIZED                                      
--------------------------------------------------
[8,[0,3,5],[11,33,77]]

Serialization in PL/SQL

Vectors can be directly serialized in PL/SQL without specifying returning or storage format clauses.

serialize-sparse-vectors-plsql.sql

declare
    v vector;
    t clob;
begin

    v := to_vector('[8,[0,3,5],[11,33,77]]',*,int8,sparse);
    t := from_vector(v);
    dbms_output.put_line(t);
    
    v := to_vector('[11,0,0,33,0,77,0,0]',*,int8);
    t := from_vector(v);
    dbms_output.put_line(t);
    
end;
/

[8,[0,3,5],[11,33,77]]
[11,0,0,33,0,77,0,0]

In PL/SQL, the from_vector or vector_serialize function does not support using any arguments for returning datatype or storage format.

declare
    v vector;
    t clob;
begin
    v := to_vector('[11,0,0,33,0,77,0,0]',*,int8);
    t := from_vector(v returning clob format sparse);
    dbms_output.put_line(t);
end;
/

PLS-00103: Encountered the symbol "FORMAT" when expecting one of the following:
   . ( ) @ % null range returning with default error using empty
   lax strict without pretty ascii true false absent allow
   truncate ignore extended

To convert the storage format with serialization in PL/SQL, we need to use SQL in order to use the storage format clause.

declare
    v vector;
    t clob;
begin
    v := to_vector('[11,0,0,33,0,77,0,0]',*,int8);
    
    select from_vector(v returning clob format sparse)
    into t;
    dbms_output.put_line('dense as sparse text: ' || t);
    
    v := to_vector('[8,[0,3,5],[11,33,77]]',*,int8, sparse);
    
    select from_vector(v returning clob format dense)
    into t;
    dbms_output.put_line('sparse as dense text: ' || t);    
end;
/

dense as sparse text: [8,[0,3,5],[11,33,77]]
sparse as dense text: [11,0,0,33,0,77,0,0]

Other Vector Functions

The vector_dimension_count (vector_dims), vector_dimension_format and vector_norm functions all work with sparse vectors as well as dense vectors. There is no difference between SQL and PL/SQL functionality with these functions.

other-vector-functions.sql

select
    vector_dimension_count(v) as dim_count
    , vector_dimension_format(v) as dim_fmt
    , vector_norm(v) as norm
from
    (
    select 
        vector('[8,[0,3,5],[11,33,77]]',*,int8,sparse) as v
    )
/

 DIM_COUNT DIM_FMT          NORM
---------- ---------- ----------
         8 INT8       8.449E+001

declare
    v vector;
begin
    v := vector('[8,[0,3,5],[11,33,77]]',*,int8,sparse);
    dbms_output.put_line('Dimension Count: ' || vector_dimension_count(v));
    dbms_output.put_line('Dimension Format: ' || vector_dimension_format(v));
    dbms_output.put_line('Vector Norm: ' || vector_norm(v));
end;
/

Dimension Count: 8
Dimension Format: INT8
Vector Norm: 8.4492603226554692E+001

Vector distances between sparse vectors can be calculated in SQL or PL/SQL with any of the available metrics applicable to non binary vectors. Because the sparse vector is conceptually a “Bag of Words” with dimensions representing tokens from the encoded text and expanded tokens that are close in meaning along with their priority weight as the dimension value, different distance metrics may be more or less useful to consider with sparse vectors.

select
    vector_distance(
        vector('[[0,3,5],[11,33,77]]', 8, int8, sparse)
        , vector('[[3,4,6],[22,15,42]]', 8, int8, sparse)
        , cosine
        ) as similarity
/

SIMILARITY
----------
8.272E-001

declare
    v1 vector;
    v2 vector;
    d binary_double;
begin
    v1 := to_vector('[8,[0,3,5],[11,33,77]]', 8, int8, sparse);
    v2 := to_vector('[8,[3,4,6],[22,15,42]]', 8, int8, sparse);

    d := vector_distance(v1, v2, cosine);
    dbms_output.put_line(d);
end;
/

8.2721506596093497E-001

Computing vector distance between sparse and dense vectors is explicitly not supported.

select
    vector_distance(
        vector('[[0,3,5],[11,33,77]]', 8, int8, sparse)
        , vector('[0,0,0,22,15,0,42,0]', 8, int8, dense)
        , cosine
        ) as similarity
/

ORA-51834: Distance computation between sparse and dense vector is not supported.

Sparse Vector Arithmetic

With dense vectors we can do arithmetic operations: addition, subtraction and multiplication. This feature is not implemented for sparse vectors.

Because the value of a sparse vector conceptually is a set of vocabulary tokens with their priority weight for the embedding, arithmetic between these dimensional values doesn’t make a lot of sense. For dense vectors, the overall vector represents a coordinate in an n-dimensional meaning space, so arithmetic is more ‘meaningful’.

As a workaround we could convert the sparse vectors to dense and do the math before converting back to sparse vectors, but this may not actually represent a valid or meaningful transformation of the vectors.

The error message about “Arithmetic operations between sparse and dense vectors” is perhaps a bit misleading in these examples because there are only sparse vectors as the arithmetic operands.

sparse-vector-arithmetic.sql

with base(v1, v2) as (
    select        
        vector('[[0,3,5],[11,33,77]]', 8, int8, sparse)
        , vector('[[3,4,6],[22,15,42]]', 8, int8, sparse)
)
select v1 + v2 as v_add
from base
/

ORA-03001: unimplemented feature
ORA-00722: Feature "Arithmetic operations between sparse and dense vectors"

with base(v1, v2) as (
    select        
        vector('[[0,3,5],[11,33,77]]', 8, int8, sparse)
        , vector('[[3,4,6],[22,15,42]]', 8, int8, sparse)
)
select v1 - v2 as v_subtract
from base
/

ORA-03001: unimplemented feature
ORA-00722: Feature "Arithmetic operations between sparse and dense vectors"

with base(v1, v2) as (
    select        
        vector('[[0,3,5],[11,33,77]]', 8, int8, sparse)
        , vector('[[3,4,6],[22,15,42]]', 8, int8, sparse)
)
select v1 * v2 as v_multiply
from base
/

ORA-03001: unimplemented feature
ORA-00722: Feature "Arithmetic operations between sparse and dense vectors"

Sparse Vector Aggregation

The aggregate functions that are valid for dense vectors (sum, avg) are not supported for sparse vectors. Like arithmetic, taking the sum or average of a set of vocabulary word priorities doesn’t make much sense.

Aggregations with sparse vectors produce the same error message as arithmetic operations.

sparse-vector-aggregation.sql

with base(v_text) as (
    values        
        ('[[0,3,5],[11,33,77]]')
        , ('[[3,4,6],[22,15,42]]')
        , ('[[2,7],[14,85]]')
), base_vectors as (
    select to_vector(v_text, 8, int8, sparse) as v
    from base
)
select avg(v) as v_avg
from base_vectors
/

ORA-03001: unimplemented feature
ORA-00722: Feature "Arithmetic operations between sparse and dense vectors"

with base(v_text) as (
    values        
        ('[[0,3,5],[11,33,77]]')
        , ('[[3,4,6],[22,15,42]]')
        , ('[[2,7],[14,85]]')
), base_vectors as (
    select to_vector(v_text, 8, int8, sparse) as v
    from base
)
select sum(v) as v_sum
from base_vectors
/

ORA-03001: unimplemented feature
ORA-00722: Feature "Arithmetic operations between sparse and dense vectors"

Zero Based Vector Dimension Indices

With the introduction of sparse vectors in 23.6 we could see the actual dimension indices for the dimensions of a vector. Originally these index values were 1-based, like a nested table or index by pls_integer associative array in PL/SQL.

As of 23.9, the dimension indexes are 0-based, like a json array. A set of vectors with a different dimension value in each vector illustrates the dimension numbering clearly when serialized to both formats for comparison.

zero-based-vector-dimension-indices.sql

with base (vector_text) as (
    values
        ('[1,0,0,0,0]'),
        ('[0,2,0,0,0]'),
        ('[0,0,3,0,0]'),
        ('[0,0,0,4,0]'),
        ('[0,0,0,0,5]')
), base_vectors as (
    select
        to_vector(vector_text, 5, int8, dense) as v
    from base
)
select
    from_vector(v returning clob format dense) as serialized_dense
    , from_vector(v returning clob format sparse) as serialized_sparse
from base_vectors
/

SERIALIZED_DENSE     SERIALIZED_SPARSE   
-------------------- --------------------
[1,0,0,0,0]          [5,[0],[1]]         
[0,2,0,0,0]          [5,[1],[2]]         
[0,0,3,0,0]          [5,[2],[3]]         
[0,0,0,4,0]          [5,[3],[4]]         
[0,0,0,0,5]          [5,[4],[5]]      

In 23.6, this same query showed that the dimension indices were 1-based:

SERIALIZED_DENSE     SERIALIZED_SPARSE   
-------------------- --------------------
[1,0,0,0,0]          [5,[1],[1]]         
[0,2,0,0,0]          [5,[2],[2]]         
[0,0,3,0,0]          [5,[3],[3]]         
[0,0,0,4,0]          [5,[4],[4]]         
[0,0,0,0,5]          [5,[5],[5]]   

Conclusion

Sparse vectors from lexical models were the precursor to dense vectors. Instead of the vector representing semantic meaning of the encoded text in an n-dimensional space, sparse vectors are based on a lexical model where each dimension is used to indicate a vocabulary word from the model with its priority in assessing the meaning of the overall vector.

Sparse vector storage for these embeddings represents a significant storage savings for models with 35,200 dimensions based on the BERT model’s vocabulary. At the same time, sparse vector embeddings can be limiting when used with specialized word usages that are not present in the model’s vocabulary.

Because most of the dimensions are 0, calculations of similarity distance can be much faster with sparse vectors. Sparse vectors may be better for some use cases because of search performance, minimal storage requirements and minimal processing requirements.

While there are workarounds based on the convertability of sparse and dense vectors, some of the limitations to the implementation of sparse vectors may be based on the conceptual differences in the models used to generate them. Using a workaround to get to an unimplemented feature may actually produce nonsensical results and should be done with caution.

Importing a LLM into the Oracle database to generate embeddings requires conversion to ONNX format using OML4PY. As we have seen in the quickstart articles for OML4PY these imported models always generate dense float32 vectors. In the next article about sparse vectors I will review generating sparse vectors in Python and inserting to the database using python-oracledb libraries.

The field of AI is rapidly changing and we will continue to see more features introduced at a fast pace in the industry. While new features are settling, the available documentation is not always adequate to provide complete understanding of available functionality. It is my hope that this article makes the available documentation and examples using Oracle sparse vectors much less ‘sparse’ than it is at present.

–Anthony Harper


Source code for this article can be found at https://github.com/gaiansentience/practicalplsql

Discussion and Comments

Leave a comment