Andy Fordyce


I have been attempting to optimise some T-SQL which runs in 20 minutes. I noticed that auto stats was set to false on the databases where the tables lived, so I changed this option to true and re-ran the query and it finished within 19 seconds. I have since imported additional data into the TestDB..data table and query performance has reverted back to the 20 minute runtime. I have updated the statistics and indexes but still the same. Is there any reason for this type of behaviour, I have the execution plans for the fast query and the slow one and they are different. Here is the query:-

SELECT D.ProductId,

D.FactId,

D.TimeId,

D.Value

FROM (((TestDB..Data D JOIN

StagingDB..GMVElements_102_0 P

ON P.ID = D.ProductId ) JOIN

StagingDB..GMVElements_102_1 R

ON R.ID = D.RegionId ) JOIN

StagingDB..GMVElements_102_2 T

ON T.ID = D.TimeId ) JOIN

StagingDB..GMVElements_102_3 F

ON F.ID = D.FactId

WHERE R.Batch = 17732





Re: Inner join performance

DanR1


Hi,

I am not sure this will solve your problem, but I am curious as to why you have used parentheses to make your JOINs appear to be "nested" when, in fact, they are each with an individual table, and with "D". Maybe the parentheses are interfering with the query plan.

I presume you have indexes on the "ID" columns of P, R, T, and F, and also on the Batch column of "R".

You might try putting the "R.Batch" requirement in the JOIN between D and R.

I'm just guessing.

Dan

P.S. It would be helpful to learn what indexes you DO have on these tables, especially the Data table, D.






Re: Inner join performance

Ennor

Andy Fordyce wrote:
I have the execution plans for the fast query and the slow one and they are different.
May you post them here I think they may contain a great deal of solution.





Re: Inner join performance

Andy Fordyce

you can get both execution plans from here :-

http://195.224.33.135/Execution_Plans.zip






Re: Inner join performance

DanR1

It certainly appears that the plan notices your parentheses, with its "Nested Loops" entries.

I would recommend that you remove all that "nesting". It may well be causing a cross-join to be created by all the tables in the parentheses other than "D".

Dan





Re: Inner join performance

Andy Fordyce

The query now looks like so after removing the nesting as suggested, but it is still using the slower execution plan

SELECT D.ProductId,

D.FactId,

D.TimeId,

D.Value

FROM TestDB..Data D JOIN

StagingDB..GMVElements_102_0 P

ON P.ID = D.ProductId JOIN

StagingDB..GMVElements_102_1 R

ON R.ID = D.RegionId JOIN

StagingDB..GMVElements_102_2 T

ON T.ID = D.TimeId JOIN

StagingDB..GMVElements_102_3 F

ON F.ID = D.FactId

WHERE R.Batch = 17732






Re: Inner join performance

DanR1

I don't know what else to suggest that would encourage the faster execution plan.

I must admit to being a bit confused about your query: you select only entries from table D, and you only have table "R" in your WHERE clause. Are the entries in D occasionally lacking entries in P, R, and F -- so that you need to JOIN on those tables to prevent the listing of entries that fail to match those other tables (Or did you just abbreviate the query, omitting some data elements that come from these other tables )

Did you say you have separate indexes on "D" for ProductID, RegionID, TimeID, and FactID

Do you have an index on "R" for which the first column is Batch

Dan





Re: Inner join performance

Andy Fordyce

This query is part of an application which i need to otismise as much as possible. I am very limited to what changes i can make. What i dont understand is the D table has a clustered index and before running small updates and deletes to this table this query was very quick. I have managed to get back to the fast performance with adding a non-clustered index, but i wanted to avoid this as the table is very large and the index takes up additional storage.




Re: Inner join performance

DanR1

Andy,

As you know, you can have only one clustered index on the table, but you can have plenty of non-clustered indexes on the table.

If you want this query to run quickly, I would suggest individual non-clustered indexes on the individual columns I mentioned earlier.

From what I understand about indexes, having a clustered index really slows down any updates (that involve columns in the clustered index) and inserts/deletes to the table. Especially with a large table, inserting a single row into the "middle of the index" requires the movement of huge amounts of data. Alternatively, you can just let your data reside in a "heap," and rely on your non-clustered indexes to navigate through that "heap" in an efficient manner.

(I like to think of a clustered index as being like a telephone book, usually with each page filled to capacity. If you insert a name, all the subsequent names must be moved to make room for the new entry. You can use FILL_FACTOR on the clustered index to leave room for new entries on each "page" of the table. So there are plenty of I/O operations involved, just to insert a single row. But if the telephone book is replaced by putting all the entries in a "heap", where the indexes are like having a search-engine's knowledge of the data, you can enter any set of indexed values to find rather quickly the "entries of interest." With the non-clustered index, you end up with a "pointer(s)" to the data row(s) of interest; with the clustered index, when you get "to the bottom" of the index, you find the row of data itself, instead of merely a pointer to the row of data.)

IMHO the main benefits for clustered indexes arise with tables that are "static," i.e., tables that experience no INSERT, UPDATE, or DELETE actions.

I have plenty of moderately large (millions of rows, as opposed to tens of millions, or billions of rows) tables for which I have only non-clustered indexes. The data access is plenty fast with non-clustered indexes. (I created the tables and indexes long before I understood anything about clustered versus non-clustered indexes -- so I just used "create index" without any special qualifiers.)

Because of the way you are performing your JOINs, with only a single column at a time, I would recommend the separate indexes for each column that you mention. The size of the index is related to the number of columns in the index, and the size of the data in each column.

Yes, it will take a bit more space to have five indexes with one column each, rather than one index with all five columns, but the five-column index is only useful when your JOINs include the "first" column listed in the index; it is not useful if you only use the "second and third" columns of the index, for instance; it is only moderately useful if you only use the "first and third" columns of the index. (I say "moderately useful" in that last instance because the index can only benefit the query with the "first" column; it cannot benefit with the "third" column because the "second" column gets in the way. Maybe think of an index on street addresses, wherein the "first" column is the alphanumeric "street number", the second column is the "street name", and the third column is the "city". To search for "street name" and "city" cannot use the index, so a complete table scan must be performed. To search for "street number", e.g., with a LIKE condition, works fine; but, to search for "street number" and "city" requires a complete scan of the portion of the table that matches the "street number" condition, since all the "street name" entries come between "street number" and "city", preventing the "city" aspect of the index from being used.) You probably know all this already, so please forgive my elaboration on this issue.

Those are the suggestions I would try, anyway, were I in your position.

Dan





Re: Inner join performance

Ennor

Well, the plans are definitely not the best ones I''ve ever seen...

What indexes do you have on the D table
On the R table
For what reasons you included P, T and F tables in the query

Also, I cannot agree with DanR1 when it comes to converting your tables to heaps. You may wish to experiment with it, but I doubt you will like the results. Be doubly careful.




Re: Inner join performance

Andy Fordyce

The D table has a clustered index on [RegionId] ASC,[FactId] ASC,[TimeId] ASC,[ProductId] ASC. The R table has a clustered index on the ID column. I no this is not ideal and this is what i am attempting to sort out. The other tables are a definate requirement and we need to select from them. The thing i am really attempting to understand is why the clustered index on the D table was using the fast plan and then after updates it caused a slow down. The D table is very large 100GB + , a non-clustered index on the D table would be very costly.




Re: Inner join performance

DanR1

Andy,

Your 100GB table is larger than any with which I have worked.

The RegionID as the first entry in your clustered index goes well with your JOIN on "R".

Here is an interesting discussion on the merits of having a clustered index:

http://www.sql-server-performance.com/clustered_indexes.asp

And here is another:

http://www.informit.com/guides/printerfriendly.asp g=sqlserver&seqNum=200&rl=1

And a technet Q/A session:

http://www.microsoft.com/technet/community/chats/trans/sql/sql0327.mspx

Because "D" has a clustered index on RegionId, it seems that should help the JOIN with "R". But a full table scan of "R" is necessary to find the entries where R.Batch = 17732, since you have no index on R.Batch. Is "R" a large table (How many rows are in "R" )

Are there indexes on the other tables: P, T, and F

Since non-clustered indexes do not contain the data, but only pointers, the size of the index on a single column of "D" would be somewhere around the number of bytes in the column, times something like "log(N)", where N is the number of rows in the table.

As you well know, often database performance is a balance of using storage to create indexes that speed some processing steps, while sometimes delaying others (due to index maintenance functions).

Dan





Re: Inner join performance

Ennor

Wow, 100GB+ table...

Anyone will say you that you need an index on R.Batch field. That's simply obvious - you have a search criteria on it. But about D table... I definitely recommend you to read something about index planning and tuning, because in case of so large tables it may take a day or so just to drop an index, let alone create a new one (and it may be a point in switching to a single user mode before doing this). So you better have to understand yourself what you are to do before you did it, agreed Smile
As a general rule, I highly recommend you to use a more simple clustered index. 4 columns, even if they are ints, will take tremendous amount of disk space and - more importantly - has nearly no sense as a part of index. Do you know, for example, that statistics created for composite indexes are built only on first column' distribution data There are many other complications as well that are specific to composite indexes and thus makes them a very specific and questionable solution. I suppose that it is better here to have a simple one-column clustered index and separate non-clustered indexes on all other fields in this table you use as a joins criteria.

But what field to choose for a clustered index... that's the question use have to answer yourself - mainly because it depends on the business logic of your DB and applications that works with it. With information I currently have I cannot make any guess in this matter.




Re: Inner join performance

Andy Fordyce

Many thanks for the info Dan, it is very helpful. the R, P, T and F tables are small with a max of 1500 rows. I think i will look into implementing non-clustered indexes to get this thing moving. The thing that still puzzles me is why i was having some good performance with the clustered index on D and then not, bizarre.




Re: Inner join performance

Chris Howarth

How many rows does your query return

Could you set the following options then run the query and post the output message text back to the forum

SET STATISTICS IO ON

SET STATISTICS TIME ON

Chris