Ed W


I apologize in advance if my explanation and questions do not use the language and conventions of programming professionals, I'm a self-taught amateur. I've used VFP 6.0 successfully for many years and written many programs that have worked well but they certainly lack the sophistication and elegance that professionals would produce.


BACKGROUND:
6 tables are related via a common index

The tables each have different numbers of records with the number of records ranging from roughly 500,000 to 800,000 and, when the project is complete, the record size will be roughly 50% larger.

The program I am writing about has one function - to calculate various values based on various subsets of the data within the 6 related tables. The calculations that are needed are based on numeric fields located in only 1 table but that table and the other 5 tables have various fields that are needed to create the filters and the subsequent subsets.

The actual filters that are used to define the subsets are kept in a field in a 7th table and each filter is "grabbed" sequentially within a SCAN/ENDSCAN loop acting on the 7th table. The filter information is then STOREd in a character variable. The filter parameters generally contain 1 to 15 fields though some are longer.

Because there are literally several thousand filters that have to be processed, I'm trying to come up with code that will run the fastest.

Currently, after creating each subset with a SET FILTER TO &cFilterFromTable7 command, the following calculations, among others, are processed (the other math functions within the program run very rapidly - the Foxpro CALCULATE command when filters are used seems to be causing what I perceive to be a slowdown):

CALCULATE cnt(), sum(TABLE1.FIELD1), sum(TABLE1.FIELD2), sum(TABLE1.FIELD3), sum(TABLE1.FIELD4) TO nTotalRecords, nField1sum, nField2sum, nField3sum, nField4sum

calc cnt() to nField1_positive for TABLE1.FIELD1 > 0
calc cnt() to nField2_positive for TABLE1.FIELD2 > 0
calc cnt() to nField3_positive for TABLE1.FIELD3 > 0

Because the tables are so large and filters inherently slow down the CALCULATE command I'm trying to create code that runs through the tables the fewest number of times but I haven't thought of way to combine the 3 statements immediately above that provide counts of the positive values of the 3 fields.

Only 99.5% of the time when FIELD1 is positive, FIELD2 and FIELD3 are positive
Only 99.5% of the time when FIELD2 is positive, FIELD3 is positive

QUESTIONS:
1. Is there are way to combine those 3 statements somehow or is there some other, better way to accomplish the goal of getting those counts

2. I understand that filters slow down the speed of the CALCULATE command considerably. Is using filters (as I currently am) better and/or faster than using a FOR /Expression with the CALC commands

3. Are there any tricks regarding the order within a filter or the use of indexes that can increase speed

4. The typical filter I am working with creates a subset that is usually well under 10,000 records, often under 1,000 records and rarely over 200,000 records. Is there some threshhold for the size of the subset that makes a difference on how best to proceed Would I be better off having 2 procedures, one for under the threshhold and one over the threshhold

5. What other consideration that I don't know to ask about have to be considered (I'm working with as fast a machine as I can afford and it is reasonably fast by today's standards. It is running XP PRO SP2 which is updated regularly. There is 75 gigs of free hard drive space on a 7200 RPM drive. I've got 2 gigs of RAM and I've turned off everything that might be working in the background, including anti-Virus software, that might use resources and interfere with the speed of calculations. The machine is not connected to a network or the internet. Because of some rather dramatic memory-related crashes when I first started testing, I think I must run the program with the following:

SYS(3050, 1, 512*1024*1024) and SYS(3050, 2, 512*1024*1024) because it tends to crash periodically without those statements and hasn't yet crashed with them. )

6. Is there any advantage upgrading to a newer version of VFP




Re: Speed with CALCULATE

Naomi Nosonovsky


Hi Ed,

Sorry if my reply is too quick and I don't read the whole long description of the problem.

I would probably approach this problem using Select-SQL and WHERE condition instead of a filter and CALCULATE. The difference is that Select-SQL works with data saved on the disk while CALCULATE and others work with the current table (if buffered it makes difference).

So, your statement can be re-written in Select-SQL (and in SUM command, if you want) in this fashion:

select sum(iif(Field1>0,1,000000000)) as CntField1, sum(iif(Field2>0,1,000000000)) as cntField2, etc. from myTable ;

where &lcFilterCond into (either cursor or array, what would you prefer)






Re: Speed with CALCULATE

Tamar E. Granor

First, as Naomi said, SQL-SELECT may give you better results. I don't fully understand what you're doing with the stored filters, but generally, SQL-SELECT is a good way to combine data from a number of tables quickly.

<< 2. I understand that filters slow down the speed of the CALCULATE command considerably. Is using filters (as I currently am) better and/or faster than using a FOR /Expression with the CALC commands >>


Filters and FOR have the same performance profile. See the answer to your next question.

<<3. Are there any tricks regarding the order within a filter or the use of indexes that can increase speed >>


The secret to getting good performance in VFP is in having the right indexes and creating the right conditions. VFP uses index tags (CDX, not IDX) to optimize queries, filters and FOR clauses.


For VFP to do its thing, though, your conditions have to use exactly the index tag key expressions. For example, if you have a tag on UPPER(Company), this filter is optimizable:


SET FILTER TO UPPER(Company)="SAFE"


but this one is not:


SET FILTER TO Company="SAFE"


If a filter contains multiple conditions, each one can be optimized separately, and VFP will optimize as many as possible.


<<
4. The typical filter I am working with creates a subset that is usually well under 10,000 records, often under 1,000 records and rarely over 200,000 records. Is there some threshhold for the size of the subset that makes a difference on how best to proceed Would I be better off having 2 procedures, one for under the threshhold and one over the threshhold >>

None of those sizes are a big deal for VFP, but a query (SQL-SELECT) that lets you deal with the subset outside of the context of the whole table will probably be faster than SET FILTER or a FOR clause.

<<
6. Is there any advantage upgrading to a newer version of VFP >>

There are lots of advantages to using VFP 9 rather than VFP 6, but I don't think there are any related to optimization.

Tamar





Re: Speed with CALCULATE

Naomi Nosonovsky

Tamar,

Your last point about optimization in VFP9 vs. VFP6. How about new BINARY indexes





Re: Speed with CALCULATE

CetinBasoz

You have an interesting problem domain and I think it's related with scientific calculations that would take serious processing times.

If we go with the last question first:

"Is there any advantage upgrading ..."

Well this is very hard to determine. I wouldn't blindly say yes. Its data engine has changed twice since VFP7 (7 same as 6). There are those reporting slower performance and I myself experience better performance so who is right I don't know in the first place. Probably due to other factors outside VFP we're getting different results. If this question wasn't specifically for your case I would say upgradding has a lot of advantages. You're using filters and VFP9 is the version that made me sound a liar in front of my fox class students - I tried to show them how slow a filter perform, I was surprised myself it did perform wellSmile Before I'd say absolutely keep away from filters but since VFP9 I doubt. But still yet I always think there are better ways than using a filter.

A reason to upgrade might be the ability to use binary indexes. Indexing on a logical expression in other versions is not a good idea but in VFP9 binary indexes make that a good option.

Now in order:

1) Yes you could combine them to a single one:

calculate sum( iff( field1 > 0,1,0 ) ), sum( iff( field2 > 0,1,0 ) ), sum( iff( field3 > 0,1,0 ) ) to array aResult

I can't say for sure it'd be faster. On my tests on a 1million records data, 1Gb RAM, X2Turion64 notebook timings were like:

0.930 vs 0.875 (average). It is faster by a slight margin and that margin might be meaningfull if done N thousands of times.

2) Well filtersSmile Many experts have it on their "never to use commands" list. I too always hated and kept away. In VFP9 however as I said before I doubt. But wait after questionsSmile

3) Tricks about order. Yes there is and it is not just for filter but for almost anything that needs to return a logical result in VFP. It's commonly known as 'shortcutting'. It goes on like this:

condition1 and condition2 and ....

When "and" is used VFP knows that as soon as it encounters a FALSE condition the whole result would be FALSE and doesn't check others.

condition1 or condition2 or ....

When "or" is used likewise as soon as a TRUE condition encountered it's unnecessary to check rest and VFP shortcuts.

So to benefit from this you put the conditions that's expected to have higher rates of being FALSE to front with "and" (or TRUE with "or").

4,5 and rest)

Well saying no connection to network etc I assume this is a perfect case for exlcusive use of local data. Because those 2 are important in mass opeations like yours. I don't know a specific threshhold.

Now let's look into it in general.

- Calculate,Count,sum ... xBase commands vs SQL. Many would think to do it with SQL but xBase approaches are generally faster by a factor of 70-100%. I can benchmark show it but in turn one could show SQL is faster with another benchmarking code. So try different variations yourself.

-Both SQL and xBase operate faster on the data that's already physically ordered in most used order(s)

-Both SQL and xBase operate faster on data that's physically smaller.

How could you use these to your benefit Let's say instead of set filter you get:

-The data that matches to your filter

-Only the fields + some precalculated values used in calculation ( ie: if you get iif(field1 > 1,1,0) as a field it'd be compact and you'd do a simple sum() )

-In the order you have your 'for' expressions most

Resulting data would be smaller, ordered and would be processed faster.

-Another interesting approach might be weird but faster:

Instead of N thousands filter + a series of calculations

scan && the table where calculation is needed

per filter expression

if matches the filter increment cnt[filterIndex,calcIndex] as appropriate

endscan

Above approach might be extremely optimized. For example you're checking count for Field1 > 0 for each and every filter If I got it right a record might skip N thousands of filter checkings as long it has a value <= 0.

Such calculations sometimes urge for other solutions:

(importance of gain in speed would make these options as worth searching)

-Use external help:

Excel. It knows how to calculateSmile Unlikely the whole operation would be faster but might worth to try.

SQL server express. VFP is fantastic in data operations but SQL server2005 might maybe faster no surprisingly (it surprised me numerous times). Might load the data there from within VFP and do calculations there. Its execution plans and such could help on optimizing.

Other languages like Fortran, C, C#.Net in conjunction with VFP

And since you said you had a fast machine it might also have multicore CPUs. Then you could divide the process onto multiple CPUs via multithreading or multiple ShellExecute 'ing modules professionalized in doing specific calculations and/or working with different filtered sets. Dividing the work on to multiple machines is yet another thing to think of.

Your filters maybe things that could define groupings for single pass SQLs.

A filter might be covering fully another that you can work with a subset of a subset of a subset ... at times.

I didn't talk about indexes at all. They are needed for performance. Too much of them, using collations other than machine, creating ones that do not make sense etc might also work against you. There are guidelines but unfortunately no one simple solid rule.

sys() functions. You may also need to use them but in my life I rarely needed to play with the defaults.





Re: Speed with CALCULATE

Ed W

To Naomi Nosonovsky, Tamar E. Granor, and CetinBasoz:

To say that I greatly appreciate your time, insights, and expertise is a gross understatement.

I will be out of town for a couple of days but will work hard next week to digest and implement what you've graciously shared.

Thank you.

Ed W





Re: Speed with CALCULATE

Tamar E. Granor

Good point. If his filters include binary conditions, then binary indexes might speed things up.

Tamar




Re: Speed with CALCULATE

Ed W

Using the SELECT-SQL and more efficient use of the CALCULATE command has proven to run at least 2.5 times faster. Having to think through the trial and error process in attempting to get the query statement correct also triggered some other thoughts on how to better organize and deal with the filters I'm using which will further speed up what I'm trying to accomplish. I'm guessing the processing time will drop to about 10% or 15% of the original code for an 85% - 90% savings of time. Dramatic. Thanks again for the ideas.

That said, I've come across a problem that was not anticipated. I'm getting different numbers of subset records when comparing the results from my original SET FILTER TO approach and the recommended/far more effective SELECT-SQL approach, with the SELECT-SQL approach always having more records in each subset. When testing without a WHERE clause, the SELECT-SQL query actually produces more records in the cursor than any of the tables from which it is produced - about 80,000 (about 11%) more or so when compared against the table with the most records.

Though I did a fair amount of research and modeled my query on examples found in the MSDN Library and on the internet, I'm sure it must be wrong in some respect. Here's the query I created:

SELECT a.nFIELD1 as Field1, a.nFIELD2 as Field2, a.nFIELD3 as Field3, ;

a.nFIELD4 as Field4 ;
FROM TABLE1 AS a ;
INNER JOIN TABLE2 AS b ON a.cIndexedFIELD5 = b.cIndexedFIELD5 ;
INNER JOIN TABLE3 AS c ON a.cIndexedFIELD5 = c.cIndexedFIELD5 ;
INNER JOIN TABLE4 AS d ON a.cIndexedFIELD5 = d.cIndexedFIELD5 ;
INNER JOIN TABLE5 AS e ON a.cIndexedFIELD5 = e.cIndexedFIELD5 ;
INNER JOIN TABLE6 AS f ON a.cIndexedFIELD5 = f.cIndexedFIELD5 ;
WHERE NOT a.lFIELD6 AND NOT a.lFIELD7 AND ;

a.cFIELD8 NOT IN ("cVALUE1","cVALUE2","cVALUE3") AND ;

NOT (a.cFIELD8 == "cVALUE5" AND a.nFIELD9 = 30) ;
INTO CURSOR cursor_name
(The filter in the WHERE clause of this example is an example I've included just in case the filters matter in some way to this discussion.)

I've tried placing all 7 tables in the FROM clause but that produced an error message saying that cIndexedFIELD5 couldn't be found.

Perhaps I'm using the wrong join expression I'm trying to create a query that will produce results identical to what the following would produce:

SELE TABLE1
SET ORDER TO cIndexedFIELD5

SET RELATION TO TABLE1.cIndexedFIELD5 INTO b ADDITIVE
SET RELATION TO TABLE1.cIndexedFIELD5 INTO c ADDITIVE
SET RELATION TO TABLE1.cIndexedFIELD5 INTO d ADDITIVE
SET RELATION TO TABLE1.cIndexedFIELD5 INTO e ADDITIVE
SET RELATION TO TABLE1.cIndexedFIELD5 INTO f ADDITIVE
SET RELATION TO TABLE1.cIndexedFIELD5 INTO g ADDITIVE
(Note: b - g are ordered on cIndexedFIELD5)

Could someone please set me straight





Re: Speed with CALCULATE

Naomi Nosonovsky

Your query looks OK to me from the quick glance.

However, you should understand the following difference.

If you have table A with N records and TableB with M records, where each record in TableA may have multiple records in TableB the inner join of TableA with tableB will produce M records (M>N). In case of multiple tables join it's a bit more complex to find the number of total records.

Here is a simple test, you may add more tables to it to see results:

Code Snippet

CREATE CURSOR curParent (nID I)
INDEX ON nID TAG nID
CREATE CURSOR curChild (nID I, nParentID I)
INDEX ON nID TAG nID
INDEX ON nParentID TAG nParentID

FOR i = 1 TO 100
INSERT INTO curChild (nID) VALUES (m.i)
IF i%10 = 0
INSERT INTO curParent VALUES (m.i)
replace nParentID WITH m.i FOR EMPTY(nParentID) IN curChild
ENDIF
NEXT

SELECT curChild.nID from curParent INNER JOIN curChild ON curParent.nID = curChild.nParentID INTO CURSOR curMix
=MESSAGEBOX(RECCOUNT('curMix'))





Re: Speed with CALCULATE

CetinBasoz

With inner join as Naomi pointed out you might be selecting more rows than needed. Your SQL looks like you only want rows from table1 which have entries in other tables. You might rewrite it like:

Code Snippet

select * from table1 ;

where !lField6 and !lField7 and !(cField8 == 'cValue5' and nField9 = 30) and ;

!(cField8 == 'cValue1') and ;

!(cField8 == 'cValue2') and ;

!(cField8 == 'cValue3') and ;

seek( table1.cIndexedField5, 'table2', 'cIndexedField5Tag' ) and ;

seek( table1.cIndexedField5, 'table3', 'cIndexedField5Tag' ) and ;

seek( table1.cIndexedField5, 'table4', 'cIndexedField5Tag' ) and ;

seek( table1.cIndexedField5, 'table5', 'cIndexedField5Tag' ) and ;

seek( table1.cIndexedField5, 'table6', 'cIndexedField5Tag' ) and ;

seek( table1.cIndexedField5, 'table7', 'cIndexedField5Tag' ) ;

into cursor cursor_name

PS: For table aliases in SQL prefer at least 2 letters aliases.





Re: Speed with CALCULATE

Ed W

CetinBasoz and Naomi,

Thank you for the prompt replies. I'll try each of your suggestions over the next day or so.

Ed W





Re: Speed with CALCULATE

Ed W

In response to CetinBasoz' generous suggestion, here's what I've tried along with the results. I'm going on the assumption that I'm making some sort of obvious error that's staring me in the face but that I'm unable to see.

On the chance that it matters but was mentioned so far back that it's been lost in the shuffle, I'm using VFP 6.0.

Attempt 1. This is what I typed (I didn't copy and paste) intending to duplicate his code:

SELECT * FROM table1 ;
WHERE !lField6 AND !lField7 AND !(cField8 == "cValue5" AND nField9 = 30) AND ;
!(cField8 == "cValue1") AND ;
!(cField8 == "cValue2") AND ;
!(cField8 == "cValue3") AND ;
seek(table1.cIndexedField5,'table2','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table3','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table4','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table5','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table6','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table7','cIndexedField5Tag') ;
INTO CURSOR cursor_name

* RESULT - Error Message: Alias not found
* COMMENT - On the chance that it matters, cIndexedField5 and cIndexedField5Tag have the same name


Attempt 2.

SELECT * FROM table1 AS aa ;
WHERE !aa.Field6 AND !Field7 AND ;

!(aa.cField8 == "cValue5" AND aa.nField9 = 30) AND ;
!(aa.cField8 == "cValue1") AND ;
!(aa.cField8 == "cValue2") AND ;
!(aa.cField8 == "cValue3") AND ;
seek(aa.cIndexedField5,'table2','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table3','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table4','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table5','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table6','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table7','cIndexedField5Tag') ;
INTO CURSOR cursor_name

* RESULT - Error Message: Alias not found


Attempt 3.

SELECT aa.nField1 as Field1, aa.nField2 as Field2, aa.nField3 as Field3, ;

aa.nField4 as Field4 ;
FROM table1 AS aa ;
WHERE NOT aa.Field6 AND NOT aa.Field7 AND ;
aa.cField8 NOT IN ("cValue1","cValue2","cValue3") AND ;
NOT (aa.cField8 == "cValue5" AND aa.nField9 = 30) AND ;
seek(aa.cIndexedField5,'table2','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table3','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table4','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table5','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table6','cIndexedField5Tag') AND ;
seek(aa.cIndexedField5,'table7','cIndexedField5Tag') ;
INTO CURSOR cursor_name

* RESULT - Error Message: Alias not found


Attempt 4.

SELECT table1.nField1 as Field1, table1.nField2 as Field2, ;

table1.nField3 as Field3, table1.nField4 as Field4 ;
FROM table1 AS aa ;
WHERE NOT table1.Field6 AND NOT table1.Field7 AND ;
table1.cField8 NOT IN ("cValue1","cValue2","cValue3") AND ;
NOT (table1.cField8 == "cValue5" AND aa.nField9 = 30) AND ;
seek(table1.cIndexedField5,'table2','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table3','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table4','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table5','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table6','cIndexedField5Tag') AND ;
seek(table1.cIndexedField5,'table7','cIndexedField5Tag') ;
INTO CURSOR cursor_name


* RESULT - Error Message: SQL: Column 'nField1' is not found


I tried a few other minor variations, all with the similar results. For example, in Attempt 4 immediately above, changing the FROM statement to either FROM table1 or FROM table1 AS table1 both produce the 'Alias not found' error message.

As indicated in my earlier post (31 Jul 2007, 1:35 UTC), the SELECT-SQL code that I've come up with runs, but it does not duplicate the results I get with the SET RELATION TO command. My goal is to find and use a SELECT-SQL statement that will duplicate the results that I get when I use the SET RELATION TO command. I believe CetinBasoz' code is meant to achieve that goal but I'm obviously not properly interpreting/executing his suggestion.

Any thoughts





Re: Speed with CALCULATE

Naomi Nosonovsky

Hi Ed,

I quickly tried something similar and achieved the result after I opened the table before running the select statement.

Here is my quick test:

SELECT * from Project WHERE SEEK(Ky,'Program','Ky') && Note, that I don't prefix Ky with the table name here.

I think you may want to start with the simple select with one seek and add more and more complexity gradually.

If your tables are huge, you may want to create a test version of your database to test these queries.

In other words, try to

1. Open the tables involved in SEEK prior to running the code.

2. Add seek conditions one by one until something breaks.

Hopefully this helps.





Re: Speed with CALCULATE

CetinBasoz

Hi Ed,

SQL opens the table(s) if they're not open already but xbase commands and functions do not (I used seek() there which assumed the tables were already open), For it to work you need to first open your tables:

use table1 in 0

use table2 in 0

use table3 in 0

*...

use table7 in 0

Also I assumed the tag name for "cIndexedField5" was named "cIndexedField5Tag". It's symbolic there as tag names are at most 10 in length.

seek() approach is specific to VFP tables and it was a replacement for EXISTS clause in SQL. It might instead look like:

select * from table1 where exists (select * from table2 where table1.cIndexedField5 == table2.cIndexedField5 )

instead of:

select * from table1 where seek(table1.cIndexedField5,"table2","cIndexedField5Tag")

In effect it's telling include records where there is a match in table2 (using cIndexedField5 as lookup field on both - relating field)





Re: Speed with CALCULATE

Tamar E. Granor

I confess that I haven't read all the messages in this thread in detail, but I can't think of any situation where I'd use SEEK in a SELECT statement. Instead, use a subquery:

For example, instead of:

seek(table1.cIndexedField5,'table2','cIndexedField5Tag')

I'd use:

WHERE EXISTS ;
(SELECT <key expression for cIndexedField5 tag of Table2> ;
FROM Table2 ;
WHERE <same expr> = Table1.cIndexedField5)

and so forth.

Tamar