Vector Math: Semantic Arithmetic

declare
    apple_pie vector;
    apple     vector;
    peach     vector;
    peach_pie vector;
begin
    if apple_pie - apple + peach = peach_pie then
        dbms_output.put_line('Peach pie sounds good!');
    else
        dbms_output.put_line('Apple pie again?');
    end if;
exception
    when others then
        dbms_output.put_line('Too futuristic?');
end;
/

PLS-00306: wrong number or types of arguments in call to '='

As of Oracle 23ai Release 7, we can do basic arithmetic with vectors in PL/SQL and SQL. Let’s explore what this ‘means’ to us as Oracle developers, pun intended.

Vectors are representations of meaning in an n-dimensional semantic space defined by a Large Language Model. Creating a vector embedding for a word, a sentence, or a chunk of a larger document is essentially mapping the meaning of the text into the vector space.

Vector arithmetic operations return the resulting vector when meanings are added or subtracted from one another. We can think of this as a form of semantic arithmetic. For this post I will focus on addition and subtraction of vectorized words and see where that takes us.

  1. Getting Started
  2. Vectors Cannot Be Compared For Equality
  3. Vector Comparisons Require Similarity
  4. Vector Similarity Is Measured By Vector Distance
  5. Finding Capital Cities with Vector Arithmetic
  6. Summary

Getting Started

The source code for all articles here on practicalplsql.org is available in the practicalplsql public repository on github in the scripts/articles directory organized by concept and article. You can use create.user.practicalplsql.sql to create the practicalplsql schema. Sample data for articles can be found in the scripts/examples directory organized by example name.

Scripts for this post are located in vector-math/part-2-conceptual-math. This post requires a minimum version of Oracle 23ai release 7. All scripts have been tested in OCI Autonomous databases and the Oracle 23ai Free database VirtualBox appliance where this release update is available.

The code for this article depends on the ALL_MINILM_L6_V2 LLM Mining Model begin loaded into the database schema where the code is running to create the appropriate embeddings. The Quickstart article on OML4PY has instructions on using Python to generate properly prepared LLM mining models, examples for loading models to the database, and links to downloadable models in ONNX format ready for use stored in my Google Drive.

Vectors Cannot Be Compared For Equality

In the documentation for vector arithmetic there is a graphic showing figuratively the idea behind vector arithmetic. The vectors for Rome, Italy, France and Paris are show as an extrapolation of this equation using an optimal Large Language Model:

Rome – Italy + France = Paris

I don’t know if the models available currently from Hugging Face that we can use OML4PY to load into the database should be considered ‘optimal’ at this time, but the idea here is very powerful: the result of adding and subtracting vectors is essentially the same as adding and subtracting the meanings of the terms that are used to create those vectors!

We can use SQL to quickly verify this vector equation:

1.01-vectors-are-not-equal.sql

Prompt Does Rome - Italy + France = Paris

with base as (
select 
    vector_embedding(ALL_MINILM_L6_V2 using 'Paris' as data) as paris
    , vector_embedding(ALL_MINILM_L6_V2 using 'France' as data) as France
    , vector_embedding(ALL_MINILM_L6_V2 using 'Rome' as data) as Rome
    , vector_embedding(ALL_MINILM_L6_V2 using 'Italy' as data) as Italy
), arithmetic_base as (
    select 
        rome - italy + france as equation
        , paris
    from base
)
select 
    case 
        when equation = paris then 'Rome - Italy + France = Paris' 
        else 'Rome - Italy + France <> Paris'
    end as check_equation
from arithmetic_base
/

ORA-22848: cannot use VECTOR type as comparison key

That’s right, vectors cannot be compared for equality. All AI Vector seaches can do is find vectors with the least amount of vector distance between them. A vector cannot even be compared to itself for equality. This means we cannot check the result of this vector equation with a simple equality operator.

Vector Comparisons Require Similarity

We have to rethink our approach to how we consider the results of vector arithmetic equations. A vector equation cannot be said to be equal to an answer. However, the result of a vector equation can be said to be more similar to one of several possible answers. The most similar choice is closest to the right result.

Similarity is measured by vector distance, so we can state that the vector equation has a shorter vector distance to the correct answer than it does to incorrect answers. There are many different approaches to vector distance, lets just use the default metric of cosine distance for out examples.

Rephrasing our question in terms of similarity needs to compare the vector distance between the arithmetic result vector and several choices, let’s see if London or Paris is a better answer:

Is (Rome – Italy + France) more similar to Paris or London?

Running this in SQL we see that Paris is the better answer to this question because the vector distance is smaller:

1.02-vectors-use-similarity.sql

Prompt Is (Rome - Italy + France) more similar to Paris or London?

with base as (
select 
    vector_embedding(ALL_MINILM_L6_V2 using 'Paris' as data) as paris
    , vector_embedding(ALL_MINILM_L6_V2 using 'France' as data) as France
    , vector_embedding(ALL_MINILM_L6_V2 using 'London' as data) as London
    , vector_embedding(ALL_MINILM_L6_V2 using 'Rome' as data) as Rome
    , vector_embedding(ALL_MINILM_L6_V2 using 'Italy' as data) as Italy
), arithmetic_base as (
select rome - italy + france as equation, paris, rome, london
from base
)
select 
    case 
        when vector_distance(equation, paris) < vector_distance(equation, london) then
            'Rome - Italy + France is more similar to Paris than London'
        else 'Rome - Italy + France is not more similar to Paris than London'
    end as check_equation
from arithmetic_base
/

CHECK_EQUATION                                                
--------------------------------------------------------------
Rome - Italy + France is more similar to Paris than London

We can ask the same question in PL/SQL and get to the same answer:

1.03-vector-similarity-plsql.sql

declare
    vector_paris vector;
    vector_france vector;
    vector_rome vector;
    vector_italy vector;
    vector_london vector;
    vector_equation vector;
    vector_distance_1 number;
    vector_distance_2 number;
begin
    
    select 
        vector_embedding(ALL_MINILM_L6_V2 using 'Paris' as data) as paris
        , vector_embedding(ALL_MINILM_L6_V2 using 'France' as data) as France
        , vector_embedding(ALL_MINILM_L6_V2 using 'Rome' as data) as Rome
        , vector_embedding(ALL_MINILM_L6_V2 using 'Italy' as data) as Italy
        , vector_embedding(ALL_MINILM_L6_V2 using 'London' as data) as London
    into vector_paris, vector_france, vector_rome, vector_italy, vector_london;
        
    vector_equation := vector_rome - vector_italy + vector_france;
    
    vector_distance_1 := vector_distance(vector_equation, vector_paris);
    vector_distance_2 := vector_distance(vector_equation, vector_london);
    
    if vector_distance_1 < vector_distance_2 then
        dbms_output.put_line('(Rome - Italy + France) is more similar to Paris than London');
    else
        dbms_output.put_line('Not quite what we expected');
    end if;
    
end;
/

(Rome - Italy + France) is more similar to Paris than London

Vector Similarity Is Measured By Vector Distance

Is this just an accident that the results are making sense? We can load the appropriate vectors into an associative array so that we can work with them more easily than declaring lots of separate variables. The cursor iteration controls introduced in Oracle 21 make it simple to load an associative array indexes by varchar2. We can then try all of the following examples to cross check the results we are seeing:

(Rome – Italy + France) is closer to Paris or London?

(Rome – Italy + United Kingdom) is closer to Paris or London?

(London – United Kingdom + Italy) is closer to Paris or Rome?

(London – United Kingdom + France) is closer to Paris or Rome?

For each ‘semantic equation’, we can look at the actual values of the cosine distance between the equation and the possible answers:

1.04-check-vector-arithmetic.sql

declare
    type t_vectors is table of vector index by varchar2(100);
    v t_vectors;
    cursor c is
    with base(term) as (
    values 
        ('Paris'),('Rome'),('London')
        ,('France'),('Italy'),('United Kingdom')
    )
    select 
        b.term, 
        vector_embedding(ALL_MINILM_L6_V2 using b.term as data) as embedding
    from base b;    
begin

    v := t_vectors(for r in c index r.term => r.embedding);
    
    v('equation') := v('Rome') - v('Italy') + v('France');
    dbms_output.put_line('(Rome - Italy + France)');
    dbms_output.put_line('Distance to Paris: ' || vector_distance(v('equation'), v('Paris')));
    dbms_output.put_line('Distance to London: ' || vector_distance(v('equation'), v('London')));

    v('equation') := v('Rome') - v('Italy') + v('United Kingdom');
    dbms_output.put_line('(Rome - Italy + United Kingdom)');
    dbms_output.put_line('Distance to Paris: ' || vector_distance(v('equation'), v('Paris')));
    dbms_output.put_line('Distance to London: ' || vector_distance(v('equation'), v('London')));
    
    v('equation') := v('London') - v('United Kingdom') + v('Italy');
    dbms_output.put_line('(London - United Kingdom + Italy)');
    dbms_output.put_line('Distance to Paris: ' || vector_distance(v('equation'), v('Paris')));
    dbms_output.put_line('Distance to Rome: ' || vector_distance(v('equation'), v('Rome')));
    
    v('equation') := v('London') - v('United Kingdom') + v('France');
    dbms_output.put_line('(London - United Kingdom + France)');
    dbms_output.put_line('Distance to Paris: ' || vector_distance(v('equation'), v('Paris')));
    dbms_output.put_line('Distance to Rome: ' || vector_distance(v('equation'), v('Rome')));
        
end;
/

(Rome - Italy + France)
Distance to Paris: 2.9965084512286433E-001
Distance to London: 5.9621648436293428E-001

(Rome - Italy + United Kingdom)
Distance to Paris: 5.9460183089094121E-001
Distance to London: 3.9756368514909457E-001

(London - United Kingdom + Italy)
Distance to Paris: 4.6865291500436113E-001
Distance to Rome: 4.090754894227332E-001

(London - United Kingdom + France)
Distance to Paris: 2.3190182385379965E-001
Distance to Rome: 4.8457594091850864E-001

In each case, the city with the lower cosine distance is closest to the capital of the country we have added after we subtracted the country from the capital city. A capital city minus its country looks to be equivalent to saying capital city. Adding a country to this concept is closer to the correct capital city of the country we added! It seems that the current LLM is able to solve the problem adequately.

Finding Capital Cities with Vector Arithmetic

It appears that the actual question we are asking with this complex equation is simply, ‘what is the capital city of this country?’. Let’s add the vector for a country name to the vector of the word ‘capital’ and compare to various city name vectors to see if this works.

2.01-country-plus-capital.sql

set serveroutput on;
declare
    type t_vectors is table of vector index by varchar2(100);
    v t_vectors;
    cursor c is
    with base(term) as (
    values 
        ('Paris'),('Rome'),('London')
        ,('France'),('Italy'),('United Kingdom')
        ,('capital')
    )
    select 
        b.term, 
        vector_embedding(ALL_MINILM_L6_V2 using b.term as data) as embedding
    from base b;    
    
begin

    v := t_vectors(for r in c index r.term => r.embedding);
    
    v('equation') := v('Italy') + v('capital');
    dbms_output.put_line('(Italy + capital)');
    dbms_output.put_line('Distance to Paris: ' || vector_distance(v('equation'), v('Paris')));
    dbms_output.put_line('Distance to London: ' || vector_distance(v('equation'), v('London')));
    dbms_output.put_line('Distance to Rome: ' || vector_distance(v('equation'), v('Rome')));

    v('equation') := v('United Kingdom') + v('capital');
    dbms_output.put_line('(United Kingdom + capital)');
    dbms_output.put_line('Distance to Paris: ' || vector_distance(v('equation'), v('Paris')));
    dbms_output.put_line('Distance to London: ' || vector_distance(v('equation'), v('London')));
    dbms_output.put_line('Distance to Rome: ' || vector_distance(v('equation'), v('Rome')));
    
    v('equation') := v('France') + v('capital');
    dbms_output.put_line('(France + capital)');
    dbms_output.put_line('Distance to Paris: ' || vector_distance(v('equation'), v('Paris')));
    dbms_output.put_line('Distance to London: ' || vector_distance(v('equation'), v('London')));
    dbms_output.put_line('Distance to Rome: ' || vector_distance(v('equation'), v('Rome')));
    
end;
/

(Italy + capital)
Distance to Paris: 5.0221515483701262E-001
Distance to London: 5.5103133474311905E-001
Distance to Rome: 3.6240224760576656E-001

(United Kingdom + capital)
Distance to Paris: 5.4262974078288229E-001
Distance to London: 3.6615389352489458E-001
Distance to Rome: 5.4107460014668729E-001

(France + capital)
Distance to Paris: 3.4374537856466458E-001
Distance to London: 5.263360473587797E-001
Distance to Rome: 4.2970427215207985E-001

Again, the vectors add up correctly! I was not sure that we would get this far with the currently available models. The correct capital city in each case has the smallest vector distance to the addition of the concepts for country name and ‘capital’.

Checking the vector distances for various cities against the result of adding the vectors for ‘Italy’ and ‘capital’ shows that the vector for ‘Rome’ is the closest solution to this vector equation.

2.02-italy-plus-capital.sql

set serveroutput on;
declare
    type t_vectors is table of vector index by varchar2(100);
    v t_vectors;
    
    cursor c_vectors is
    with base(term) as (
    values 
        ('Paris'),('Boston'),('Rome'),('Milan'),('Venice'),('London'),('Oxford')
        ,('Italy'),('capital')
    )
    select term, vector_embedding(all_minilm_l6_v2 using term as data) as embedding
    from base;    
begin   

    v := t_vectors(for r in c_vectors index r.term => r.embedding);
    
    v('equation') := v('Italy') + v('capital');

    dbms_output.put_line('(Italy + capital)');
    dbms_output.put_line('Vector Distance to Rome: ' || (v('equation') <=> v('Rome')));  
    dbms_output.put_line('Vector Distance to Milan: ' || (v('equation') <=> v('Milan')));
    dbms_output.put_line('Vector Distance to Venice: ' || (v('equation') <=> v('Venice')));
    dbms_output.put_line('Vector Distance to Boston: ' || (v('equation') <=> v('Boston')));
    dbms_output.put_line('Vector Distance to Oxford: ' || (v('equation') <=> v('Oxford')));        
    dbms_output.put_line('Vector Distance to London: ' || (v('equation') <=> v('London')));    
    
end;
/

(Italy + capital)
Vector Distance to Rome: 3.6240224760576656E-001
Vector Distance to Milan: 4.1047204677196603E-001
Vector Distance to Venice: 3.8740155351994054E-001
Vector Distance to Boston: 3.9790772095926996E-001
Vector Distance to Oxford: 6.1004993062506807E-001
Vector Distance to London: 5.5103133474311905E-001

Instead of printing out the vector distances, we can just compare all the distances and keep the smallest distance as representative of the correct answer:

2.03-france-plus-capital.sql

set serveroutput on;
declare
    type t_vectors is table of vector index by varchar2(100);
    v t_vectors;
    v_cities    t_vectors;
    l_distance number;
    l_distance_min number;
    l_capital_city varchar2(100);
    
    cursor c_cities is
    with base(term) as (
    values 
        ('Paris'),('Rome'),('London'),('Oxford'),('Boston'),('Berlin'),('Munich')
        ,('Lima'),('Milan'),('Venice'),('Tokyo'),('Dublin'),('Chicago'),('Washington DC')
    )
    select b.term, vector_embedding(all_minilm_l6_v2 using b.term as data) as embedding
    from base b;  
    
begin   

    select vector_embedding(all_minilm_l6_v2 using 'France' as data) into v('France');
    select vector_embedding(all_minilm_l6_v2 using 'capital' as data) into v('capital');
    
    v_cities := t_vectors(for r in c_cities index r.term => r.embedding);
    
    for city_name, city_vector in pairs of v_cities loop
        dbms_output.put_line('Checking similarity for ' || city_name);
        l_distance := vector_distance(v('France') + v('capital'), city_vector); 
        
        if l_distance_min is null then 
            l_distance_min := l_distance;
            l_capital_city := city_name;  
        elsif l_distance < l_distance_min then
            l_distance_min := l_distance;
            l_capital_city := city_name;
        end if;
    end loop;
    
    dbms_output.put_line('(France + capital) is most similar to ' || l_capital_city);
    
end;
/

Checking similarity for Berlin
Checking similarity for Boston
Checking similarity for Chicago
Checking similarity for Dublin
Checking similarity for Lima
Checking similarity for London
Checking similarity for Milan
Checking similarity for Munich
Checking similarity for Oxford
Checking similarity for Paris
Checking similarity for Rome
Checking similarity for Tokyo
Checking similarity for Venice
Checking similarity for Washington DC

(France + capital) is most similar to Paris

For our simple example, the vector of the correct capital city always has the closest vector distance to the vector result of the equation.

Our simple PL/SQL program is looking like a geography whiz with one country at a time…let’s see if it can match a set of countries with the appropriate capital city using vector arithmetic:

2.04-geography-test.sql

set serveroutput on;
declare
    type t_vectors is table of vector index by varchar2(100);
    l_distance number;
    l_city varchar2(100);
    l_distance_min number;
    l_capital_city varchar2(100);
    v_cities    t_vectors;
    v_countries t_vectors;
    v_terms     t_vectors;
    
    cursor c_cities is
    with base(term) as (
    values 
        ('Paris'),('Rome'),('London'),('Oxford'),('Boston'),('Berlin'),('Munich')
        ,('Lima'),('Milan'),('Venice'),('Tokyo'),('Dublin'),('Chicago'),('Washington DC')
    )
    select b.term, vector_embedding(all_minilm_l6_v2 using b.term as data) as embedding
    from base b;  
    
    cursor c_countries is
    with base(term) as (
    values 
        ('France'),('Italy'),('Germany'),('Peru'),('Japan'),('United States')
        ,('United Kingdom'),('Great Britain'),('England')
    )
    select b.term, vector_embedding(all_minilm_l6_v2 using b.term as data) as embedding
    from base b; 
    
    cursor c_terms is
    with base(term) as (
    values ('capital'),('city'),('country')
    )
    select b.term, vector_embedding(all_minilm_l6_v2 using b.term as data) as embedding
    from base b;     
begin   
    
    v_terms := t_vectors(for r in c_terms index r.term => r.embedding);
    v_countries := t_vectors(for r in c_countries index r.term => r.embedding);
    v_cities := t_vectors(for r in c_cities index r.term => r.embedding);
    
    for country_name, country_vector in pairs of v_countries loop
        
        for city_name, city_vector in pairs of v_cities loop
            l_distance := country_vector + v_terms('capital') <=> city_vector;
            if l_distance_min is null then
                l_distance_min := l_distance;
                l_capital_city := city_name;
            elsif l_distance < l_distance_min then
                l_distance_min := l_distance;
                l_capital_city := city_name;
            end if;                
        end loop;
        
        dbms_output.put_line('(' || country_name || ' + capital) is ' || l_capital_city);
        
        --reset for the next country
        l_distance_min := null;
        l_capital_city := null;
    end loop;
    
end;
/

(England + capital) is Boston
(France + capital) is Paris
(Germany + capital) is Berlin
(Great Britain + capital) is London
(Italy + capital) is Rome
(Japan + capital) is Tokyo
(Peru + capital) is Lima
(United Kingdom + capital) is London
(United States + capital) is Washington DC

Well, Boston is not really the capital of England, but our LLM did pretty well in this test! The training for this model must not have included the fact that London is the capital of England, in addition to being the capital of Great Britain and the United Kingdom. We now know that simple vector arithmetic is possible, and that it will not always be perfect. Perhaps that is why the documentation specifies that this calculation is correct with an ‘optimal’ LLM.

Summary

Semantic arithmetic is now supported in the database…thinking about these examples sheds a bit of light on how vector embeddings are created for larger pieces of text. The sum of all vectors in a document chunk is roughly equivalent to the vector embedding that we get for that chunk. Negation of a word in a sentence is like subtracting the meaning of that word from the sentence vector.

We are still at the beginning of a new semantic calculus, but these small experiments in vector arithmetic are definitely pointing to a future that is coming faster than we might have expected.

In closing, lets rethink the title example using similarity instead of equality as a measure of the correctness of our vector arithmetic:

3.01-pie-example-reconsidered.sql

declare
    type t_vectors_nt is table of vector index by varchar2(100);
    v t_vectors_nt;
    
    cursor c is
    with base(term) as (
        values ('apple pie'),('peach pie'),('apple'),('peach')
    )
    select 
        b.term
        , vector_embedding(
            dev_vector.ALL_MINILM_L6_V2 using b.term as data
            ) as embedding
    from base b;
begin

    v := t_vectors_nt(for r in c index r.term => r.embedding);
    
    v('compromise') := v('apple pie') - v('apple') + v('peach');
        
    if vector_distance(v('compromise'), v('peach pie')) 
        < vector_distance(v('compromise'), v('apple pie')) then
        dbms_output.put_line('Peach pie sounds good!');
    else
        dbms_output.put_line('Peach pie is not a substitute for apple pie!');
    end if;
end;
/

Peach pie sounds good!

The basics of vector arithmetic open vast horizons of possibilities in this new world of actual word problems that can be ‘solved’ using semantic arithmetic with vectors. For my part, I will stick to exploring the mechanics of working with the vector datatype within the Oracle database and see where it takes me.

–Anthony Harper

Discussion and Comments

Leave a comment