Saturday, January 31, 2009

Adventures in LINQ: Deferred execution

I recently started a new project in ASP.NET 3.5. It actually a total re-write of an old application I created some time ago. I haven't done any significant web development since the official release of 3.5 (hard to believe) so I needed to get back on the horse and get up to speed. One of the goals is to get familiar with LINQ. One of my co-workers used it for a project we architected and he said that LINQ saved boat loads of effort so I figured it was high time for me to get my act together.

One of the first things I've discovered with LINQ is how chatty it is with the database server. After starting to experiment with LINQ I turned on the SQL Server Profiler so I could watch what LINQ was doing with the database. I noticed far more queries being executed than I'd have expected. Way more than I would execute myself if I were programming the traditional way with my own explicitly defined queries. A key source of the excessive query traffic is the concept of deferred execution.

Here's how it plays out.  I have my digital photos in a file repository on disk.  I'm building a database driven tagging system for them.  In my photo library database, I have a table called "Photo".  This abstracts out to the "Photos" class in the LINQ to SQL classes DBML.  I need to scan the file repository for new photos and check the database to see if they already exist.  Simple enough.  For each photo file I encounter I want to check the database.  It stands to reason that I shouldn't query the DB for each individual file, but rather preload the photo table to an in memory collection then look through the collection for each photo that's encountered.  So I approach it this way:

01 string[] strFiles;
02 strFiles = Directory.GetFiles(dirPath, "*.*");
03
04 var lstPhotos = from p in _db.Photos select p;
05 for(int i = 0; i < strFiles.Length; i++)
06 {
07 intPhotoCount = (from p in lstPhotos
08 where p.RelativePath == strFiles[i]
09 select p).Count();
10 if(intPhotoCount == 0)
11 {
12 // do create photo stuff
13 }
14 }

What I expected to see was a single query fired against the database at line 04.  But I didn't.  Instead, I got one query for each iteration of the loop, fired at line 07-09.  This query, as suggested by the LINQ, contained an explicit 'where' clause for the RelativePath value test.  

exec sp_executesql N'SELECT COUNT(*) AS [value]
FROM [dbo].[Photo] AS [t0]
WHERE [t0].[RelativePath] = @p0',
N'@p0 varchar(30)',@p0='.JPG'

With many thousands of photos, this isn't good.  So I thought that maybe if I forced the data context into delivering a full set the collection would preload as I planned.  I added this line immediately after the initial LINQ (after line 04):

lstPhotos.GetEnumerator();

When I ran this, I saw this SQL execute:

SELECT [t0].[PhotoId], [t0].[CollectionId]
, [t0].[RelativePath], [t0].[Title]
, [t0].[Description], [t0].[Height], [t0].[Width]
FROM [dbo].[Photo] AS [t0]

Great! However, it's immediate followed up by the same loop specific queries. All I did was add a superfluous query to the noise. The GetEnumerator() call got me thinking that I could use the actual resulting enumerator. However, my original approach was: for each file encountered, check the existing photo list from the database to see if I've already captured it. In order to use the enumerator, I have to reverse the tactic because we can't find a single item using an enumerator, only go through the enumerator's set front to back.

So let's reverse the approach. Instead of iterating through the file list, we'll go through the LINQ result set. Then I'll remove the file item that matches. The result will be the files that don't yet exist in the database. We can then add those.

01 List lstFiles = new List(strFiles);
02 var lstPhotos = from p in _db.Photos select p;
03 foreach(Photo photo in lstPhotos)
04 {
05 lstFiles.Remove(photo.RelativePath);
06 }
07 if(lstFiles.Count > 0)
08 {
09 for(int i = 0; i < lstFiles.Count; i++)
10 {
11 // do new photo stuff
12 }
13 }

First I create a generic collection out of the original file name string array (return type of Directory.GetFiles()). After the LINQ for the complete result set from Photos, I do a foreach, which calls .GetEnumerator() on lstPhotos. This results in a single SQL query for the entire result set. Then I iterate through the results, removing any matches. Last, I do the actual creation of the remaining items.

I find more and more with .NET that I must work as though I'm carving an ice sculpture. Instead of trying to build something up with little cubes it's often easier to start with a single large block and take away what you don't need. This is a surprisingly applicable analogy. While you need a lot more space (memory) to start with the large block of ice (data), once you've chipped away what you don't need, it's easily cleaned up (garbage collection). The alternative is a process that takes considerably longer and may ultimately be more fragile in the end.

It's amazing how often thinking backwards can lead to a better solution.