Table Data Comparison: Row Differences

declare
l_task varchar2(100);
l_goal varchar2(100);
begin
l_task := 'select rows from source and target that have differences';
l_goal := 'simplify this task by using pl/sql';

dbms_output.put_line(l_task);
dbms_output.put_line(l_goal);
end;
/
  1. Getting Started
  2. Example Data
  3. Set Operators
  4. SQL Table Macros
  5. Set Operators and SQL Macros
  6. Full Outer Joins
  7. Full Outer Joins and SQL Macros
  8. Viewing Generated Macro SQL With DBMS_OUTPUT
  9. Full Outer Join Using JSON
  10. Row Comparisons: Real World Cases
  11. Column Comparisons

A common task that database developers encounter is comparing data between two different sources to see rows with differences. This can be a table that is loaded with new data that needs to be compared with existing rows to see what rows need to be loaded. If we rewrite an existing view, the output of the previous form of the view needs to be compared to the output of the rewritten view to make sure it is still functionally the same. Sometimes we just want to compare two sql statements to see if they produce identical outputs.

There are a few standard ways to compare differences between two data sources, Chris Saxon has an excellent article outlining different methods and the pros and cons for each approach here: https://blogs.oracle.com/sql/post/how-to-compare-two-tables-to-get-the-different-rows-with-sql. Chris also illustrates how to convert the group by method into a reusable sql macro.

In this article I want to take a closer look at the other three methods of row comparision and work through creating sql table macros for each of them. The second part of this article will then take a deeper dive into breaking row comparisons into column comparisons.

Getting Started

The source code for all articles here on practicalplsql.org is available in the practicalplsql public repository on github in the scripts directory. You can use scripts/create.user.practicalplsql.sql to create the practicalplsql schema. Sample data for articles can be found in the scripts/examples directory organized by example name.

Scripts for this post are located in scripts/articles/table-data-compare/row-differences. For the examples I will be using a simple table populated with products, replicated to two tables with some differences. This example is located at scripts/examples/table-data-compare. The example can be created by running the create_example.sql script in SQLplus or SQL Developer. All examples have been tested in SQL Developer. All scripts should be executed with serveroutput on, so this repetition is omitted for clarity.

Some queries or code build on previous examples, these omit the repeated code for brevity, the source code has no omissions. Finally, when several scripts produce the same correct output, the output will only be shown once to avoid repetition.

Example Data

This is the data being compared for all examples between the products_source and products_target tables. I have highlighted the differences with brackets, but even with this small dataset it is difficult to see all of the differences.

select product_id, code, name, description, style, msrp from products_source

PRODUCT_ID CODE NAME DESCRIPTION STYLE MSRP
---------- ----- ------------------- ------------------------- ------------ ----
1 P-ES Everest Summit [Mt.] Everest Summit 18x20 30
2 P-EB Everest Basecamp [Mt.] Everest Basecamp 18x20 30
3 P-FD [Fujiyama] Dawn Mount Fuji at dawn 11x17 [20]
4 P-FS [Fujiyama] Sunset Mount Fuji at sunset 11x17 20
5 P-K2 K2 Summit K2 summit 11x17 20
6 PC-ES Everest Postcards [Mt.] Everest postcards [5x7] 9
7 PC-FJ Fuji Postcards Mount Fuji postcards <<null>> 9
8 PC-K2 K2 Postcards K2 postcards [Color] 9
[ 9 PC-S Shasta Postcards Mount Shasta postcards 5x7 9]

select product_id, code, name, description, style, msrp from products_target

PRODUCT_ID CODE NAME DESCRIPTION STYLE MSRP
---------- ----- ------------------- ------------------------- ------------ ----
1 P-ES Everest Summit [Mount] Everest Summit 18x20 30
2 P-EB Everest Basecamp [Mount] Everest Basecamp 18x20 30
3 P-FD [Fuji] Dawn Mount Fuji at dawn 11x17 [19]
4 P-FS [Fuji] Sunset Mount Fuji at sunset 11x17 20
5 P-K2 K2 Summit K2 summit 11x17 20
6 PC-ES Everest Postcards [Mount] Everest postcards [Monochrome] 9
7 PC-FJ Fuji Postcards Mount Fuji postcards <<null>> 9
8 PC-K2 K2 Postcards K2 postcards [<<null>>] 9

Note: Brackets Added To Highlight Differences

Set Operators

The classic way to compare rows between two tables, views or queries uses set operators. The minus operator will give us all rows from A that are not in B. Reversing this operation returns all rows in B that are not in A. Any identical results are not in either of these sets. Union All can then be used to display all differences from both directions.

Logically this can be expressed as:

(a minus b) union all (b minus a)

Note: This approach would not catch duplicate rows in either A or B, adapting the query to use minus all would show cases where there were not the same numbers of duplicates in each, but it would not identify cases where there were the same number of duplicates in both. In this article I will be working with two tables that have a primary key, so duplicates are not possible within a row source.

A flag needs to be added to each minus query to show the row source of the rows, if they were in A but not in B or or in B but not in A.

Phrasing this logic in a query we get the following SQL statement:

select u.* from
(
(
select 'source' as row_source, s.* from products_source s
minus
select 'source' as row_source, t.* from products_target t
)
union all
(
select 'target' as row_source, t.* from products_target t
minus
select 'target' as row_source, s.* from products_source s
)
) u
order by u.product_id, u.row_source;

Ordering by the primary key and the row_source allows us to see the differences clearly:

ROW_SOURCE PRODUCT_ID CODE  NAME                DESCRIPTION               STYLE        MSRP
---------- ---------- ----- ------------------- ------------------------- ------------ ----
source 1 P-ES Everest Summit [Mt.] Everest Summit 18x20 30
target 1 P-ES Everest Summit [Mount] Everest Summit 18x20 30
source 2 P-EB Everest Basecamp [Mt.] Everest Basecamp 18x20 30
target 2 P-EB Everest Basecamp [Mount] Everest Basecamp 18x20 30
source 3 P-FD [Fujiyama] Dawn Mount Fuji at dawn 11x17 [20]
target 3 P-FD [Fuji] Dawn Mount Fuji at dawn 11x17 [19]
source 4 P-FS [Fujiyama] Sunset Mount Fuji at sunset 11x17 20
target 4 P-FS [Fuji] Sunset Mount Fuji at sunset 11x17 20
source 6 PC-ES Everest Postcards [Mt.] Everest postcards [5x7] 9
target 6 PC-ES Everest Postcards [Mount] Everest postcards [Monochrome] 9
source 8 PC-K2 K2 Postcards K2 postcards [Color] 9
target 8 PC-K2 K2 Postcards K2 postcards [<<null>>] 9
source [9 PC-S Shasta Postcards Mount Shasta postcards 5x7 9]

13 rows selected

This approach clearly shows the rows with differences, including the rows where the difference is a null column in one table and a populated column in the other.

The obvious drawback to this approach is that it requires two full table scans of each table:

-------------------------------------------------
| Id | Operation | Name |
-------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | SORT ORDER BY | |
| 2 | VIEW | |
| 3 | UNION-ALL | |
| 4 | MINUS HASH | |
| 5 | TABLE ACCESS FULL| PRODUCTS_SOURCE |
| 6 | TABLE ACCESS FULL| PRODUCTS_TARGET |
| 7 | MINUS HASH | |
| 8 | TABLE ACCESS FULL| PRODUCTS_TARGET |
| 9 | TABLE ACCESS FULL| PRODUCTS_SOURCE |
-------------------------------------------------

We can avoid this by using CTE expressions so that the rows from each table are materialized before applying the comparison logic:

with source_table as (
select /*+ materialize */ s.* from products_source s
), target_table as (
select /*+ materialize */ t.* from products_target t
)
select u.* from (
(
select 'source' as row_source, s.* from source_table s
minus
select 'source' as row_source, t.* from target_table t
)
union all
(
select 'target' as row_source, t.* from target_table t
minus
select 'target' as row_source, s.* from source_table s
)
) u
order by u.product_id, u.row_source;

Using this approach we only have one full table scan of each table to support the comparision.

-------------------------------------------------------------------------------
| Id | Operation | Name |
-------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | |
| 1 | TEMP TABLE TRANSFORMATION | |
| 2 | LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D66AD_E134873 |
| 3 | TABLE ACCESS FULL | PRODUCTS_SOURCE |
| 4 | LOAD AS SELECT (CURSOR DURATION MEMORY)| SYS_TEMP_0FD9D66AE_E134873 |
| 5 | TABLE ACCESS FULL | PRODUCTS_TARGET |
| 6 | SORT ORDER BY | |
| 7 | VIEW | |
| 8 | UNION-ALL | |
| 9 | MINUS HASH | |
| 10 | VIEW | |
| 11 | TABLE ACCESS FULL | SYS_TEMP_0FD9D66AD_E134873 |
| 12 | VIEW | |
| 13 | TABLE ACCESS FULL | SYS_TEMP_0FD9D66AE_E134873 |
| 14 | MINUS HASH | |
| 15 | VIEW | |
| 16 | TABLE ACCESS FULL | SYS_TEMP_0FD9D66AE_E134873 |
| 17 | VIEW | |
| 18 | TABLE ACCESS FULL | SYS_TEMP_0FD9D66AD_E134873 |
-------------------------------------------------------------------------------

Using the CTE approach also allows us to apply filter predicates to each table, limit the columns being compared and apply aliases if the tables have different column names.

The entire comparison part of the query is always the same once we have switched to the CTE approach. To compare two rowsets we can simply define the source_table and target_table expressions and then copy our standard comparision query. This highlights the fact that the comparison query is always the same.

SQL Table Macros

The introduction of SQL table macros in Oracle 21c allows the reuse of standard query patterns with a simplified calling syntax. A macro is a PL/SQL function which returns a SQL statement with the sql_macro annotation. This requires the return type to be a string: either varchar2 or clob.

SQL macros are called at parse time for the query, resulting in a transformation to the pattern defined by the query. The largest benefit of this is that there is not a context switch during the execution of the query. The query is transformed before the optimizer evaluates the execution plan.

Note: Oracle 21c has both table macros and scalar macros, so the annotation specifies (table) or (scalar), table is the default. Only the table macros were backported to Oracle 19c in Release Update 7, so the annotation cannot have the (table) argument in 19c.

The simplest form of a macro just returns static SQL:

create or replace function row_count_static
return varchar2 sql_macro(table)
is
begin
return 'select count(*) as row_count from products_source';
end row_count_static;

--In 19c the annotation does not have the (table) parameter:
create or replace function row_count_static
return varchar2 sql_macro

To use the macro we just select from the function:

select * from row_count_static()

To make this more reusable, it would be better to pass the table we want to query. This is done by using the dbms_tf.table_t type that was introduced for polymorphic table functions to pass in the identifier of a table.

create or replace function row_count_dynamic(
p_table in dbms_tf.table_t)
return varchar2
sql_macro (table)
is
begin
return 'select count(*) as row_count from p_table';
end row_count_dynamic;

Now the macro can be used for any table or CTE expression:

select * from row_count_dynamic(products_target);

select * from row_count_dynamic(user_objects);

with base as (
select * from products_source
union all
select * from products_target
)
select * from row_count_dynamic(base);

To pass one or more columns to the macro, we can use the dbms_tf.columns_t to pass a collection of column identifiers. To construct macros for table comparisons, it is useful to be able to include or exclude columns, or to identify columns as join keys in the resulting sql statement.

Unlike table parameters using the table_t, these parameters are not referenced by parameter name directly in the string that is the SQL statement. We need to address the columns collection elements and concatenate the results into the SQL string. It is more readable to build up any expressions that refer to columns and then replace target tokens in the SQL statement.

Set Operators and SQL Macros

Since we know that the CTE approach is more efficient, we could create a macro that used a With clause, but this could not be called from queries that were already using a With clause. A better approach is to construct the macro using the simpler syntax and then pass CTE expressions as parameters to get the possibility of performance optimizations from reducing the number of full table scans. Note: The materialize hint is undocumented, and it will not always force these CTE expressions to be materialized.

To optionally order the results by a key column or unique identifier, we can add a columns parameter with a null default.

create or replace function row_compare_union(
p_source in dbms_tf.table_t,
p_target in dbms_tf.table_t,
p_order_columns in dbms_tf.columns_t default null
) return varchar2
sql_macro $if dbms_db_version.version >= 21 $then (table) $end
is
l_sql varchar2(32000);
begin

l_sql :=
q'[
select u.*
from (
(
select 'source' as row_source, s.*
from products_source s
minus
select 'source' as row_source, t.*
from products_target t
)
union all
(
select 'target' as row_source, t.*
from products_target t
minus
select 'target' as row_source, s.*
from products_source s
)
) u
]';

if p_order_columns is not null then
l_sql := l_sql || ' order by ';
for i in 1..p_order_columns.count loop
l_sql := l_sql || 'u.' || p_order_columns(i) || ',';
end loop;
l_sql := l_sql || ' u.row_source';
end if;

dbms_output.put_line(l_sql);
return l_sql;

end row_compare_union;

Now we can easily run the row comparisons without actually writing the query.

select * from row_compare_union(products_source, products_target, columns(product_id));

select * from row_compare_union(products_source, products_target, columns(code));

select * from row_compare_union(products_source, products_target);

with source_rows as (
select * from products_source
), target_rows as (
select * from products_target
)
select * from row_compare_union(source_rows, target_rows, columns(product_id));

This returns the same results as writing out the complete query, with significantly simplified syntax.

The set operator approach is still limited because it cannot handle clob columns unless they are converted to a hash value.

Full Outer Joins

We can also compare rows using a full outer join on all columns. This solves the performance issue of two full table scans for each table that we encountered in the set operator approach. In its simplest form, this approach cannot handle nulls correctly, and clob columns cannot be compared.

select 
coalesce(s.row_source, t.row_source) as row_source
, coalesce(s.product_id, t.product_id) as product_id
, coalesce(s.code, t.code) as code
, coalesce(s.name, t.name) as name
, coalesce(s.description, t.description) as description
, coalesce(s.style, t.style) as style
, coalesce(s.msrp, t.msrp) as msrp
from
(
select 'source' as row_source, s.* from products_source s
) s
full outer join (
select 'target' as row_source, t.* from products_target t
) t
on s.product_id = t.product_id
and s.code = t.code
and s.name = t.name
and s.description = t.description
and s.style = t.style
and s.msrp = t.msrp
where s.product_id is null or t.product_id is null
order by product_id, row_source

This approach is limited because the equality joins are not null aware, identical rows with a null value in the same column will show as differences. The differences are correctly displayed, but the matching rows with a null column are also included as differences. Row 7 is identical in both tables, but the null in the style column makes it show up as a difference.

source              7 PC-FJ Fuji Postcards      Mount Fuji postcards      <<null>>        9
target 7 PC-FJ Fuji Postcards Mount Fuji postcards <<null>> 9

Instead of writing out all of the columns in the join and coalescing all of their values, we can also use a full natural join to further simplify the syntax:

select *
from
(
select 'source' as row_source_s, s.*
from products_source s
) s
natural full outer join (
select 'target' as row_source_t, t.*
from products_target t
) t
where row_source_s is null or row_source_t is null
order by product_id

The full natural join approach looks simple, but it has the same issue with null columns and special handling for clobs cannot be introduced.

Using a full outer join with the columns specified and the joins specified would allow us to write the where clause to be aware of nulls by using decode or another similar technique. This approach could also support clob columns by using dbms_lob.compare.

select 
coalesce(s.row_source, t.row_source) as row_source
, coalesce(s.product_id, t.product_id) as product_id
, coalesce(s.code, t.code) as code
, coalesce(s.name, t.name) as name
, coalesce(s.description, t.description) as description
, coalesce(s.style, t.style) as style
, coalesce(s.msrp, t.msrp) as msrp
from
(
select 'source' as row_source, s.* from products_source s
) s
full outer join (
select 'target' as row_source, t.* from products_target t
) t
on s.product_id = t.product_id
and decode(s.code, t.code, 1, 0) = 1
and decode(s.name, t.name, 1, 0) = 1
and decode(s.description, t.description, 1, 0) = 1
and decode(s.style, t.style, 1, 0) = 1
and decode(s.msrp, t.msrp, 1, 0) = 1
--if a column was a clob, use alternate join method
--and dbms_lob.compare(s.clob_col, t.clob_col) = 1
where s.product_id is null or t.product_id is null
order by product_id, row_source

Using decode in the where clause gets rid of the false differences due to a null column in both tables, and the differences are now displayed correctly. If there were clob columns, they would also be handled.

Full Outer Joins and SQL Macros

Specifying all of the select columns and the join columns can be handled dynamically if we use a SQL table macro function. We can use the columns collection from the dbms_tf.table_t input parameter to do this easily.

The column information here also includes the column datatype as an Oracle typecode, allowing clob columns to be compared with alternate join syntax.

A parameter of dbms_tf.columns_t can be used to specify the id column for filtering records that don’t satisfy the full outer join, and for sorting the output.

We can also pay attention to the formatting of the SQL so that the output is indented nicely. The dbms_output statement will output the statement on hard parse when executed in a script. Using substitution tokens will also make the main body of the sql statement more readable.

Here is the macro function that accounts for identical rows with null columns by using decode and acounts for clob columns by using dbms_lob.compare:

create or replace function row_compare_full_join_decode(
p_source in dbms_tf.table_t,
p_target in dbms_tf.table_t,
p_id_column in dbms_tf.columns_t
) return varchar2
sql_macro $if dbms_db_version.version >= 21 $then (table) $end
is
c_lf constant varchar2(1) := chr(10);
l_sql varchar2(32000);
l_column dbms_tf.column_metadata_t;
l_coalesce varchar2(4000);
l_joins varchar2(4000);
begin

for i in 1..p_source.column.count loop
l_column := p_source.column(i).description;
l_coalesce := l_coalesce
|| case when i > 1 then ' ' end
|| ', coalesce(s.' || l_column.name
|| ', t.' || l_column.name
|| ') as ' || l_column.name || c_lf;
end loop;
l_coalesce := trim(trailing c_lf from l_coalesce);

for i in 1..p_source.column.count loop
l_column := p_source.column(i).description;
l_joins := l_joins
|| case when i > 1 then ' and ' end
|| case
when l_column.name member of p_id_column then
's.' || l_column.name || ' = t.' || l_column.name
when l_column.type = dbms_tf.type_clob then
'dbms_lob.compare(s.' || l_column.name || ', t.' || l_column.name || ') = 1'
else
'decode(s.' || l_column.name || ', t.' || l_column.name || ', 1, 0) = 1'
end || c_lf;
end loop;
l_joins := trim(trailing c_lf from l_joins);

l_sql :=
q'[
select
coalesce(s.row_source, t.row_source) as row_source
##COALESCE_COLUMNS##
from
(
select 'source' as row_source, src.* from p_source src
) s
full outer join (
select 'target' as row_source, tgt.* from p_target tgt
) t
on ##JOIN_COLUMNNS##
where
s.##ID_COLUMN## is null
or t.##ID_COLUMN## is null
order by ##ID_COLUMN##, row_source
]';

l_sql := replace(l_sql, '##COALESCE_COLUMNS##', l_coalesce);
l_sql := replace(l_sql, '##ID_COLUMN##', p_id_column(1));
l_sql := replace(l_sql, '##JOIN_COLUMNNS##', l_joins);

dbms_output.put_line(l_sql);

return l_sql;

end row_compare_full_join_decode;

This approach will handle clobs and null columns in identical rows correctly.

Viewing Generated Macro SQL With DBMS_OUTPUT

The dbms_output statement with the sql variable provides a quick way to see the generated SQL statement by running it in a PL/SQL block:

set serveroutput on;
declare
l_sql varchar2(4000);
i number;
begin
l_sql :=
q'[
select count(*) as differences
from row_compare_full_join_decode(products_source, products_target, columns(product_id))
]';

execute immediate l_sql into i;
dbms_output.put_line('function is called on hard parse, sql statement prints');
execute immediate l_sql into i;
dbms_output.put_line('no hard parse, function is not called, statement doesnt print');
end;
/

This also shows us that the function is only called from a hard parse. The second execution of the statement is already parsed and will not call the function:

select 
coalesce(s.row_source, t.row_source) as row_source
, coalesce(s."PRODUCT_ID", t."PRODUCT_ID") as "PRODUCT_ID"
, coalesce(s."CODE", t."CODE") as "CODE"
, coalesce(s."NAME", t."NAME") as "NAME"
, coalesce(s."DESCRIPTION", t."DESCRIPTION") as "DESCRIPTION"
, coalesce(s."STYLE", t."STYLE") as "STYLE"
, coalesce(s."MSRP", t."MSRP") as "MSRP"
from
(
select 'source' as row_source, src.* from p_source src
) s
full outer join (
select 'target' as row_source, tgt.* from p_target tgt
) t
on s."PRODUCT_ID" = t."PRODUCT_ID"
and decode(s."CODE", t."CODE", 1, 0) = 1
and decode(s."NAME", t."NAME", 1, 0) = 1
and decode(s."DESCRIPTION", t."DESCRIPTION", 1, 0) = 1
and decode(s."STYLE", t."STYLE", 1, 0) = 1
and decode(s."MSRP", t."MSRP", 1, 0) = 1
where
s."PRODUCT_ID" is null or t."PRODUCT_ID" is null
order by "PRODUCT_ID", row_source

function is called on hard parse, sql statement prints
no hard parse, function is not called, statement doesnt print

PL/SQL procedure successfully completed.

Full Outer Join Using JSON

The limitations of the Full Outer Join approach can be solved more simply by converting the rows to JSON and then using JSON_EQUAL for the full outer join predicate. The appeal of this approach is the fact that JSON handles clob columns and compares nulls with no further manipulation. Note: in 19c we want to return the json as a clob, in 21c we can return it as json in native binary format to further optimize performance.

One or more columns can also be used in addition to the json_equal predicate to modify the join performance.

select 
coalesce(s.row_source, t.row_source) as row_source,
coalesce(s.jdoc, t.jdoc) as jdoc
from
(
select
'source' as row_source
, product_id
, code
, json_object(* returning clob) as jdoc
from products_source
) s
full outer join (
select
'target' as row_source
, product_id
, code
, json_object(* returning clob) as jdoc
from products_target
) t
on s.product_id = t.product_id
and s.code = t.code
and json_equal(s.jdoc, t.jdoc)
where s.product_id is null or t.product_id is null
order by coalesce(s.product_id, t.product_id), row_source

The output in this case will format each row as a JSON object, and all rows with differences are correctly identified:

ROW_SOURCE JDOC                                                                                                                              
---------- ----------------------------------------------------------------------------------------------------------------------------------
source {"PRODUCT_ID":1,"CODE":"P-ES","NAME":"Everest Summit","DESCRIPTION":"Mt. Everest Summit","STYLE":"18x20","MSRP":30}
target {"PRODUCT_ID":1,"CODE":"P-ES","NAME":"Everest Summit","DESCRIPTION":"Mount Everest Summit","STYLE":"18x20","MSRP":30}
source {"PRODUCT_ID":2,"CODE":"P-EB","NAME":"Everest Basecamp","DESCRIPTION":"Mt. Everest Basecamp","STYLE":"18x20","MSRP":30}
target {"PRODUCT_ID":2,"CODE":"P-EB","NAME":"Everest Basecamp","DESCRIPTION":"Mount Everest Basecamp","STYLE":"18x20","MSRP":30}
source {"PRODUCT_ID":3,"CODE":"P-FD","NAME":"Fujiyama Dawn","DESCRIPTION":"Mount Fuji at dawn","STYLE":"11x17","MSRP":20}
target {"PRODUCT_ID":3,"CODE":"P-FD","NAME":"Fuji Dawn","DESCRIPTION":"Mount Fuji at dawn","STYLE":"11x17","MSRP":19}
source {"PRODUCT_ID":4,"CODE":"P-FS","NAME":"Fujiyama Sunset","DESCRIPTION":"Mount Fuji at sunset","STYLE":"11x17","MSRP":20}
target {"PRODUCT_ID":4,"CODE":"P-FS","NAME":"Fuji Sunset","DESCRIPTION":"Mount Fuji at sunset","STYLE":"11x17","MSRP":20}
source {"PRODUCT_ID":6,"CODE":"PC-ES","NAME":"Everest Postcards","DESCRIPTION":"Mt. Everest postcards","STYLE":"5x7","MSRP":9}
target {"PRODUCT_ID":6,"CODE":"PC-ES","NAME":"Everest Postcards","DESCRIPTION":"Mount Everest postcards","STYLE":"Monochrome","MSRP":9}
source {"PRODUCT_ID":8,"CODE":"PC-K2","NAME":"K2 Postcards","DESCRIPTION":"K2 postcards","STYLE":"Color","MSRP":9}
target {"PRODUCT_ID":8,"CODE":"PC-K2","NAME":"K2 Postcards","DESCRIPTION":"K2 postcards","STYLE":null,"MSRP":9}
source {"PRODUCT_ID":9,"CODE":"PC-S","NAME":"Shasta Postcards","DESCRIPTION":"Mount Shasta postcards","STYLE":"5x7","MSRP":9}

13 rows selected.

It is harder to pick out the differences when the row is formatted as JSON. With a little work we can convert back to a columnar output with json_value or json_table:

select base.row_source, base.product_id, j.*
from
(
select
coalesce(s.row_source, t.row_source) as row_source,
coalesce(s.product_id, t.product_id) as product_id,
coalesce(s.jdoc, t.jdoc) as jdoc
from
(
select
'source' as row_source
, product_id
, code
, json_object(* returning clob) as jdoc
from products_source
) s
full outer join (
select
'target' as row_source
, product_id
, code
, json_object(* returning clob) as jdoc
from products_target
) t
on s.product_id = t.product_id
and s.code = t.code
and json_equal(s.jdoc, t.jdoc)
where s.product_id is null or t.product_id is null
) base,
json_table(base.jdoc
columns (
code path '$.CODE.string()',
name path '$.NAME.string()',
description path '$.DESCRIPTION.string()',
style path '$.STYLE.string()',
msrp path '$.MSRP.number()'
)
) j
order by product_id, row_source

The result is now the same as the other approaches, with proper null handling and clob handling built in:

ROW_SOURCE PRODUCT_ID CODE  NAME                DESCRIPTION               STYLE        MSRP
---------- ---------- ----- ------------------- ------------------------- ------------ ----
source 1 P-ES Everest Summit Mt. Everest Summit 18x20 30
target 1 P-ES Everest Summit Mount Everest Summit 18x20 30
source 2 P-EB Everest Basecamp Mt. Everest Basecamp 18x20 30
target 2 P-EB Everest Basecamp Mount Everest Basecamp 18x20 30
source 3 P-FD Fujiyama Dawn Mount Fuji at dawn 11x17 20
target 3 P-FD Fuji Dawn Mount Fuji at dawn 11x17 19
source 4 P-FS Fujiyama Sunset Mount Fuji at sunset 11x17 20
target 4 P-FS Fuji Sunset Mount Fuji at sunset 11x17 20
source 6 PC-ES Everest Postcards Mt. Everest postcards 5x7 9
target 6 PC-ES Everest Postcards Mount Everest postcards Monochrome 9
source 8 PC-K2 K2 Postcards K2 postcards Color 9
target 8 PC-K2 K2 Postcards K2 postcards <<null>> 9
source 9 PC-S Shasta Postcards Mount Shasta postcards 5x7 9

Full Outer Join Using JSON and SQL Macro

The JSON approach converts easily to a sql macro function:

create or replace function row_compare_json(
p_source in dbms_tf.table_t,
p_target in dbms_tf.table_t,
p_id_column in dbms_tf.columns_t
) return varchar2
sql_macro (table)
is
l_sql varchar2(32000);
begin

l_sql :=
q'[
select
coalesce(s.row_source, t.row_source) as row_source,
coalesce(s.##ID_COLUMN##, t.##ID_COLUMN##) as ##ID_COLUMN##,
coalesce(s.jdoc, t.jdoc) as jdoc
from
(
select
'source' as row_source
, ##ID_COLUMN##
, json_object(* returning clob) as jdoc
from p_source
) s
full outer join (
select
'target' as row_source
, ##ID_COLUMN##
, json_object(* returning clob) as jdoc
from p_target
) t
on s.##ID_COLUMN## = t.##ID_COLUMN##
and json_equal(s.jdoc, t.jdoc)
where
s.##ID_COLUMN## is null
or t.##ID_COLUMN## is null
order by ##ID_COLUMN##, row_source
]';

l_sql := replace(l_sql, '##ID_COLUMN##', p_id_column(1));
dbms_output.put_line(l_sql);
return l_sql;

end row_compare_json;

Returning the columnar output made it easier to see the differences than seeing the entire row as flat json. The macro can be written to return this format using the columns collection of the table_t type parameter to construct the json_table clause. In this case we use the column datatypes to decide what the json path item method should be for each column:

create or replace function row_compare_json_alt_to_columns(
p_source in dbms_tf.table_t,
p_target in dbms_tf.table_t,
p_id_column in dbms_tf.columns_t
) return varchar2
sql_macro $if dbms_db_version.version >= 21 $then (table) $end
is
c_lf constant varchar2(1) := chr(10);
l_sql varchar2(4000);
l_ids varchar2(4000);
l_joins varchar2(4000);
l_json_object varchar2(100);
l_json_table_columns varchar2(4000);
l_column dbms_tf.column_metadata_t;
l_unquoted varchar2(128);
begin

$if dbms_db_version.version = 19 $then
l_json_object := 'json_object(* returning clob)';
$else
l_json_object := 'json_object(* returning json)';
$end

for i in 1..p_id_column.count loop
l_ids := l_ids
|| case when i > 1 then ' , ' else ', ' end
|| p_id_column(i) || c_lf;
end loop;
l_ids := trim(trailing c_lf from l_ids);

for i in 1..p_id_column.count loop
l_joins := l_joins
|| case when i > 1 then ' and ' end
|| 's.' || p_id_column(i)
|| ' = t.' || p_id_column(i) || c_lf;
end loop;
l_joins := trim(trailing c_lf from l_joins);

for i in 1..p_source.column.count loop
l_column := p_source.column(i).description;
l_unquoted := trim(both '"' from l_column.name);
l_json_table_columns := l_json_table_columns
|| case when i > 1 then ' , ' end
|| l_unquoted || ' path ''$.' || l_unquoted
|| case
when l_column.type = dbms_tf.type_number then '.number()'''
else '.string()'''
end
|| c_lf;
end loop;
l_json_table_columns := trim(trailing c_lf from l_json_table_columns);

l_sql :=
q'[
select base.row_source, j.*
from
(
select
coalesce(s.row_source, t.row_source) as row_source
, coalesce(s.jdoc, t.jdoc) as jdoc
from
(
select
'source' as row_source
##ID_COLUMNS##
, ##JSON_OBJECT## as jdoc
from p_source
) s
full outer join
(
select
'target' as row_source
##ID_COLUMNS##
, ##JSON_OBJECT## as jdoc
from p_target
) t
on ##JOIN_COLUMNNS##
and json_equal(s.jdoc, t.jdoc)
where s.##ID_COLUMN_ONE## is null or t.##ID_COLUMN_ONE## is null
) base,
json_table(base.jdoc
columns (
##JSON_TABLE_COLUMNS##
)
) j
order by ##ID_COLUMN_ONE##, row_source
]';

l_sql := replace(l_sql, '##ID_COLUMNS##', l_ids);
l_sql := replace(l_sql, '##JOIN_COLUMNNS##', l_joins);
l_sql := replace(l_sql, '##ID_COLUMN_ONE##', p_id_column(1));
l_sql := replace(l_sql, '##JSON_OBJECT##', l_json_object);
l_sql := replace(l_sql, '##JSON_TABLE_COLUMNS##', l_json_table_columns);
dbms_output.put_line(l_sql);
return l_sql;
end row_compare_json_alt_to_columns;

And using the macro is still very simple, all of the complexity has been hidden within the macro implementation:

select *
from row_compare_json_alt_to_columns(products_source, products_target, columns(product_id, code));

Because we have full control over how the sql is generated by the macro, we can also tune it after studying the resulting explain plans. In the article source code I have generated the explain plans for all the macro approaches, from these you can see that some approaches are better than others.

Row Comparisons: Real World Cases

I have set up the basic elements for a more advanced row comparison in the source repository under examples/geodata-ext-tables. There are csv files with cities and zip codes from simplemaps and geonames, as well as external table definitions for each file. If you set up all of the external tables you can try running row comparisons on some much larger datasets.

Column Comparisons

Now that we can easily compare two row sources to determine rows with differences, the next logical step is isolating the columns with the differences.

The json method produces a flat json document for each row where columns and their values are represented in key value pairs. Before we move on to the next article in this series, lets look at how to dynamically parse this flat json into unpivoted rows that show columns and their data.

This will give us a tool for easily unpivoting the results of the json row comparison methods and using the unpivoted results as inputs to the row comparison methods to find specific columns with differences. We will create this utility using a json dataguide to find out what the json keys that need to be unpivoted are. The article will be located here: practicalplsql.org/json-dynamic-unpivot-dataguide.

The next article on table data comparison will focus on column level comparison using unpivot to convert row differences to column differences in a few ways. As soon as its published I will update this link: practicalplsql.org/table-data-comparison-columns.

–Anthony Harper


Source code for this article can be found at https://github.com/gaiansentience/practicalplsql

Posted

in

, ,

by

Discussion and Comments

Leave a comment