In performance tuning, there is no set path. One always has to burn one’s hand and try different routes and then take the most optimized one. The other day we’re tuning an old code which loads source data from a file to temp table to final tables. The first part of the process was handled by SQL Loader and was pretty fast already and hence we didn’t want to check the external tables option. Hence we decided to focus on the second part, loading from staging table to final tables.
Each record in the staging tables was at the lowest level, say child, but included both parent and grandparent fields. So records looked something like this:
111----Grand Parent Fields---222|555---Parent Fields---666|ABC---Child Fields---DEF
111----Grand Parent Fields---222|555---Parent Fields---666|GHI---Child Fields---JKL
111----Grand Parent Fields---222|777---Parent Fields---888|MNO---Child Fields---PQR
333----Grand Parent Fields---444|999---Parent Fields---000|STU---Child Fields---XYZ
Each record in the staging tables was at the lowest level, say child, but included both parent and grandparent fields. So records looked something like this:
111----Grand Parent Fields---222|555---Parent Fields---666|ABC---Child Fields---DEF
111----Grand Parent Fields---222|555---Parent Fields---666|GHI---Child Fields---JKL
111----Grand Parent Fields---222|777---Parent Fields---888|MNO---Child Fields---PQR
333----Grand Parent Fields---444|999---Parent Fields---000|STU---Child Fields---XYZ
Wherein each child record was different. A parent record can repeat for a certain number of children and similarly a grandparent will repeat for certain number of parent records.
The existing code was something like this:
For ln_loop IN (Select all_records from Staging_table)
LOOP
----Get Grandparent level information---
SELECT fields
INTO local variables
FROM table
WHERE field1 = ln_loop.field1
AND field2 = ln_loop.field2;
----Get some Parent level informtion ---
----Get some Child level informaion ---
----INSERT/UPDATE Final tables one by one---
END LOOP;
The first approach was changing the mode from row-based to set-based using BULK COLLECT and FOR ALL. However, since our code had to get information at three different levels and insert into multiple table, we had to create one more staging table to insert final data into it. This reduced the time from X min for Y records to X/2 min for same number of records. But the problem with this approach was too many changes which might have increased our testing life cycle period.
The second approach was to introduce hash-map. We created an hash array and the index of the array was Grandparent level identifier. We stored our values into this Hash array. Next time, if another record comes belonging to the same grandparent, we can just check if the value exist in hash array or not. If not, then compute and store the value in hash array or else take it from the array. So, if there are 10 different records at GP level, each GP has 10 P Levels and each P has 10 C records; there are a total of 1000 unique C level records. Earlier we’re hitting DB for 1000 times for a single GP, P and C level query. With this approach we reduced it to 10 times for GP, 100 times for P and 1000 times for C level query. The overall time taken was reduced to X/2.5 for Y records. And biggest advantage was less code change. We only had to add IF..ELSE and TYPE handling statements.
The pseudo-code is as follows:
Type lr_record is record(Field1 NUMBER,
Field2 NUMBER,
.
.
.
FieldN NUMBER);
Type lt_hash_array is table of lr_record index by varchar2(20);
la_hash_array lt_hash_array;
For ln_loop IN (Select all_records from Staging_table)
LOOP
IF la_hash_array.exists(ln_loop.field1||’-‘||ln_loop.field2) then
--Take value from array
ELSE
SELECT fields
INTO local variables
FROM table
WHERE field1 = ln_loop.field1
AND field2 = ln_loop.field2;
--Update hash array
la_hash_array.exists(ln_loop.field1||’-‘||ln_loop.field2).field1 := local_var from previous query.
la_hash_array.exists(ln_loop.field1||’-‘||ln_loop.field2).field2 := local_var from previous query.
.
.
.
la_hash_array.exists(ln_loop.field1||’-‘||ln_loop.field2).fieldN := local_var from previous query.
la_hash_array.exists(ln_loop.field1||’-‘||ln_loop.field2).field1 := local_var from previous query.
la_hash_array.exists(ln_loop.field1||’-‘||ln_loop.field2).field2 := local_var from previous query.
.
.
.
la_hash_array.exists(ln_loop.field1||’-‘||ln_loop.field2).fieldN := local_var from previous query.
END IF;
END LOOP;
END LOOP;
Conclusion: Like I said, there is no fixed rule. In this case, Hash map wins over BULK method esp with regards to less change and less work for comparable performance. However, BULK is still my favorite whenever am loading data from one system to another.
Thanks for this article on plsql hashmap.
ReplyDeletewhat does this have to do with hashing???
ReplyDeleteassociative arrays are actually not using hashing methods. Oracle says it's the equivalent to e.g. a Java HashMap, but it's not.
The keys are *sorted*, so basically all interactions with the table require a O(log n) runtime.
A true hashmap would require O(1) runtime. For insanely huge tables that might actually make a difference.
Great and I have a dandy proposal: How To Budget House Renovation house makeover
ReplyDelete