Monday, August 29, 2011

Streaming SQL Server VARBINARY(MAX) In Chunks With ADO.NET

A few years ago it was unthinkable to store BLOB data like large documents directly inside of a database. The most common way to store those large objects in combination with a database was to save the data to the file system and just store the UNC path in our database. Today storing BLOB data directly inside of a database became a common requirement in many projects. This new way to save binary data brings up a some advantages and drawbacks to think about.

The probably most important advantage is security. If saving BLOB data in file system we not only have to manage security on database level but also restrict access to the file system, to avoid unauthorized access. Another advantage is integrity. A UNC file path, stored in a database becomes easily invalid if someone delete the file. This can happen by maleware attacks but also by bugs in a software that changes one side but not the other one. A file name doesn't provide any way to enforce referential integrity. Sometimes it is tricky to keep things synchronized, especially when working with transactions.

On the other hand there is especially one drawback, when storing BLOBs inside of the database. If data are really large, say larger than 10MB or 100MB, it becomes tricky to save and read them without allocating huge amounts of data on server and client side. Where smaller data can easily be pushed or pulled in one chunk we don't want the client or server to handle 100MB as one big bunch of memory.

Missing Class in ADO.NET Framework?

On SQL Server 2008 when working with VARBINARY(MAX) based on FILESTREAM we can use the ADO.NET SqlFileStream class to stream data from and to SQL Server. Unfortunately this class does not work if data are stored inside of a VARBINARY(MAX) column and not everyone wants to activate FILESTREAM - out of other reasons because of the integrity problem.

SqlBinaryData Class

To become able to stream data stored in a VARBINARY(MAX) column, I wrote a helper class called SqlBinaryData that provides a simple but powerful interface.
public class SqlBinaryData : IBinaryData {
   //...
   
   /// <summary>
   /// Creates a new readable instance of a <see cref="Stream"/> for the BLOB column
   /// </summary>
   /// <returns>A readable stream.</returns>
   public Stream OpenRead() {
      return new SqlBinaryReader( // ...
   }

   /// <summary>
   /// Creates a new writable instance of a <see cref="Stream"/> for the BLOB column
   /// </summary>
   /// <returns>A writable stream.</returns>
   public Stream OpenWrite(bool append) {
      return new SqlBinaryWriter( // ...
   }

   // ...
}

Since the class works with four additional internal classes (about 700 lines of code) I will only focus on the most important parts here. You can find a ZIP archive containing the whole source code and my unit tests downloadable at the end of this post.

The OpenRead method creates an instance of an internal SqlBinaryReader class that implements a readable version of a System.IO.Stream and returns it to the consumer. The OpenWrite method creates an instance of an internal SqlBinaryWriter class that implements a writable version of System.IO.Stream and returns it. The other two classes, SqlBinaryInfo and SqlBinaryMetaData are primary used to provide database connections, handle possible SQL injections and provide metadata.

Metadata Evaluation And Caching

Metadata, like primary key column and table names are established from SQL Server when first needed. The first time when this data is needed is when a stream becomes read or written. Until this there will be no database connection opened to avoid unnecessary resource allocation and database utilization.

All metadata are cached for the current application domain. If an instance of a SqlBinaryData needs database metadata that have already been established for the same table, binary column and connection string this information will be reused from an internal cache. If metadata for an unknown binary column is requested from two threads it is possible that one of the threads becomes suspended until metadata are allocated by the second thread. This will likely not cause any problems but should be known.

If a database schema might change over the life time of a process and the primary key column becomes changed by name or type it is possible to clear the internal cache by calling the static method SqlBinaryData.ClearMetaDataCache(). A reading or writing stream that is already active while the cache becomes cleared will keep working with its previously requested metadata what might cause an error, any new creation of a stream, even from an already existing SqlBinaryData instance causes a reload of metadata from database.

Metadata are allocated by utilizing SQL Server INFORMATION_SCHEMA views, so should be accessible with most common database user rights.

Table and schema name are taken, and so validated, from INFORMATION_SCHEMA.TABLES view.
-- If table schema is specified
SELECT QUOTENAME(TABLE_NAME), QUOTENAME(TABLE_SCHEMA), TABLE_SCHEMA
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = @tableName
   AND TABLE_SCHEMA = @schemaName

-- If table schema is not specified
SELECT QUOTENAME(TABLE_NAME), QUOTENAME(TABLE_SCHEMA), TABLE_SCHEMA
FROM INFORMATION_SCHEMA.TABLES
WHERE TABLE_NAME = @tableName
   AND TABLE_SCHEMA = OBJECT_SCHEMA_NAME(OBJECT_ID(@tableName))
Primary key column information is taken from INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE view.
SELECT QUOTENAME(COLUMN_NAME)
FROM INFORMATION_SCHEMA.CONSTRAINT_COLUMN_USAGE
WHERE TABLE_NAME = @tableName AND TABLE_SCHEMA = @schemaName
AND OBJECTPROPERTY(OBJECT_ID(QUOTENAME(CONSTRAINT_SCHEMA)
         + '.' + QUOTENAME(CONSTRAINT_NAME))
      ,'IsPrimaryKey') = 1
If no primary key was found an exception will be thrown. If the primary key consists of more than one column an exception will be thrown since this version of SqlBinaryData class does not support composite primary keys.

Binary column information is taken from INFORMATION_SCHEMA.COLUMNS view.
SELECT QUOTENAME(COLUMN_NAME), DATA_TYPE, CHARACTER_MAXIMUM_LENGTH
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = @tableName
   AND TABLE_SCHEMA = @schemaName
   AND COLUMN_NAME = @columnName
If the returned column is not of type VARBINARY(MAX) an exception will be thrown.

SQL Injection

To avoid possibility of SQL injection all provided table and column names are quoted by utilizing the SQL Server QUOTENAME function, as shown above. All values, like the value of the primary key to stream binary data from or to, are passed to SQL Server as SqlParameter objects to avoid SQL injection and let SQL Server reuse cached execution plans.

SqlBinaryReader Class

Internal class that is responsible for streaming read access to BLOB. As already noted, it implements a System.IO.Stream to provide a well known standard behavior.
class SqlBinaryReader : Stream { //...
The most important public method the Read method, that copies data into a specified array:
public override int Read(byte[] buffer, int offset, int count) {
   // first read
   if (_buffer == null)
      ReadChunk();

   int done = 0;

   while (count != done) {
      // read buffered data into provided buffer
      done += ReadInternal(buffer, offset + done, count - done);
      // end of DB data reached
      if (_buffer.Length < _info.BufferSize)
         break;
      // read next chunk from database if needed
      if (done < count)
         ReadChunk();
   }

   return done;
}
The called method ReadInternal gets data out of the internal buffer into the specified one. The ReadChunk method handles the database access and is shown here:
private void ReadChunk() {
   SqlBinaryMetaData metaData = GetMetaData();
   // create an internal database connection if not yet available
   if (_cn == null)
      _cn = _info.CreateConnection(GetMetaData().ConnectionString);

   // create an internal data reader
   if (_reader == null) {
      string sql =
         string.Format("SELECT {0} FROM {1} WHERE {2} = @pk",
                        metaData.BinaryColumn,
                        metaData.QualifiedTableName,
                        metaData.PkColumn);
      using (var cmd = new SqlCommand(sql, _cn)) {
         cmd.Parameters.Add(_info.CreatePkParam());
         // open the reader with sequencial access behavior to enable 
         // streaming data from database
         _reader = cmd.ExecuteReader(CommandBehavior.SequentialAccess);
         _reader.Read();
      }
   }

   int read = (int)_reader.GetBytes(0, _position, _buffer, 0, _buffer.Length);
   if (read != _buffer.Length)
      Array.Resize(ref _buffer, read);
   _offset = 0;
}
The internal instance of an SqlDataReader is created with option CommandBehavior.SequentialAccess what causes ADO.NET to stream data out of SQL Server instead of pulling all data in one chunk. The reader and its parent connection are cleared from memory when the stream instance becomes disposed.

Data will always be streamed with the specified buffer size, even if the caller requests a larger count of bytes in one call. This is to avoid a too high memory utilization on server side. So it is still possible to get 10MB of data in one single call of Read method, even if this might cause 10 database calls if buffer size is specified for 1MB.

SqlBinaryWriter Class

Internal class that is responsible for streaming write access to a VARBINARY(MAX) column of a tables row. As well as the SqlBinaryReader it implements a System.IO.Stream.
class SqlBinaryWriter : Stream {
The two important public methods of the write stream are Write and Flush method.

The Write method streams the provided data in chunks into the internal buffer. Whenever the internal buffer is full, it calls the Flush method which is responsible to writing the current chunk of data to the server.
public override void Write(byte[] buffer, int offset, int count) {
   if (_failedState)
      throw new InvalidOperationException("Stream is in a failed state");

   int done = 0;
   while (done != count) {
      int chunk = Math.Min(_internalBuffer.Length - _internalOffset, count - done);
      // push a chunk of bytes into the internal buffer
      Array.Copy(buffer, offset + done, _internalBuffer, _internalOffset, chunk);
      _internalOffset += chunk;
      // if internal buffer is full, flush to database
      if (_internalOffset == _internalBuffer.Length)
         Flush();
      done += chunk;
   }
}
To avoid unneeded memory copying on server side new data are appended to the database field by using the VARBINARY(MAX).WRITE method (see UPDATE (Transact SQL)) which performs a partial update of existing data instead of a complete reallocation of all data.

public override void Flush() {
   if (_internalOffset == 0)
      return;
   if (_failedState)
      return;

   SqlBinaryMetaData metaData = GetMetaData();

   using (SqlConnection cn = _info.CreateConnection(metaData.ConnectionString))
   using (var tran = cn.BeginTransaction()) {
      try {
         // handle NULL value and "append" configuration
         PrepareValue(cn, tran);
         // UPDATE SchemaName.TableName 
         // SET BinaryColumn.WRITE(@buffer, @offset, @count) 
         // WHERE PkColumn = @pk
         string sql =
            string.Format("UPDATE {0} SET {1}.WRITE(@buffer, @offset, @count) WHERE {2} = @pk",
                          metaData.QualifiedTableName,
                          metaData.BinaryColumn,
                          metaData.PkColumn);
         using (var cmd = new SqlCommand(sql, cn)) {
            cmd.Transaction = tran;

            var bufferParam = cmd.Parameters.Add("@buffer", SqlDbType.VarBinary, _info.BufferSize);
            var offsetParam = cmd.Parameters.Add("@offset", SqlDbType.Int);
            var countParam = cmd.Parameters.Add("@count", SqlDbType.Int);
            cmd.Parameters.Add(_info.CreatePkParam());

            byte[] buffer;
            if (_internalOffset == _internalBuffer.Length)
               buffer = _internalBuffer;
            else {
               // avoid bumping not needed data over network
               buffer = new byte[_internalOffset];
               Array.Copy(_internalBuffer, buffer, _internalOffset);
            }

            bufferParam.Value = buffer;
            // VARBINARY(MAX).WRITE works with a zero based index
            offsetParam.Value = _position;
            countParam.Value = _internalOffset;
            // write chunk
            int affected = cmd.ExecuteNonQuery();
            _info.AssertOneRowAffected(affected);
            _position += _internalOffset;
            _internalOffset = 0;
         }
         tran.Commit();
      }
      catch {
         _failedState = true;
         tran.Rollback();
         throw;
      }
   }
}
The PrepareValue method, called from Flush handles the preparation of the database field for the first call. If the stream was created with option "append" false it resets any existing data to initial value "0x". If the stream was created with "append" true it determines the current length of the data in the binary field or sets it to "0x", if its value is currently NULL.

How To Use

Here is an example that shows how to use the class to streaming write and read data.
SqlBinaryData data = 
   SqlBinaryData.CreateLongPrimaryKey(ConnectionString, "TestBlob", "Data", 3L, 5);

byte[] expected = Guid.NewGuid().ToByteArray();
using (var writer = data.OpenWrite(false)) {
   writer.Write(expected, 0, 4);
   writer.Write(expected, 4, expected.Length - 4);
}

byte[] actual = new byte[expected.Length];
using (var reader = data.OpenRead()) {
   reader.Read(actual, 0, 3);
   reader.Read(actual, 3, actual.Length - 3);
}

Assert.IsTrue(expected.SequenceEqual(actual));

To create a new instance of a streaming providing SqlBinaryData class you can use the constructor. The parameter pkParam is used as specification of the primary key column.
public SqlBinaryData(string connectionString, string tableName, string tableSchema,
                     string binaryColumn, SqlParameter pkParam, object pkValue, 
                     int bufferSize);
In case of an INT, BIGINT or UNIQUEIDENTIFIER primary key column, you can also use one of the static factory methods of SqlBinaryData (each one with and without specifying a table schema name):
public static SqlBinaryData CreateIntPrimaryKey(string connectionString, 
      string tableName, string tableSchema, string binaryColumn, int pkValue, int bufferSize);

public static SqlBinaryData CreateIntPrimaryKey(string connectionString, 
      string tableName, string binaryColumn, int pkValue, int bufferSize);

public static SqlBinaryData CreateLongPrimaryKey(string connectionString, 
      string tableName, string tableSchema, string binaryColumn, long pkValue, int bufferSize);

public static SqlBinaryData CreateLongPrimaryKey(string connectionString, 
      string tableName, string binaryColumn, long pkValue, int bufferSize);

public static SqlBinaryData CreateGuidPrimaryKey(string connectionString, 
      string tableName, string binaryColumn, Guid pkValue, int bufferSize);

public static SqlBinaryData CreateGuidPrimaryKey(string connectionString, 
      string tableName, string tableSchema, string binaryColumn, Guid pkValue, int bufferSize);

Transactions

Since the classes create their database connections inside of the class you cannot directly use a SqlTransaction to scope your DML operations. However, since the ADO.NET data provider for SQL Server supports transaction scoping it is still possible to ensure an isolated processing by using the System.Transactions.TransactionScope.
using (TransactionScope tran = new TransactionScope()) {
   using (var cn = CreateConnection()) {
      // do stuff
   }

   var binaryData = CreateBinary(10L, 16);
   using (var stream = binaryData.OpenWrite(false)) {
      byte[] toWrite = Guid.NewGuid().ToByteArray();
      stream.Write(toWrite, 0, toWrite.Length);
   }

   using (var cn = CreateConnection()) {
      // do more stuff
   }
}

Server Side Resource Allocation

Since we can specify the exact buffer size to work with and chunks are always taken in this size we know that we are fine on client side. The other side to look at is what resources are allocated on server side. We will do this by imitating the classes behavior in SQL Server Management studio and checking results in SQL Server Profiler.

Here is the setup up for the following tests.
SET NOCOUNT ON;

IF (OBJECT_ID('TestBlob') IS NOT NULL)
   DROP TABLE TestBlob;

CREATE TABLE TestBlob (
   Id INT NOT NULL PRIMARY KEY CLUSTERED
   ,Data VARBINARY(MAX)
);

INSERT INTO TestBlob VALUES (1, 0x);
INSERT INTO TestBlob VALUES (2, 0x);
GO
First, let's UPDATE a row, with 50MB of binary data in one single batch.
-- one big update with 50MB of data
DECLARE @data VARBINARY(MAX) = 
   CONVERT(VARBINARY(MAX), REPLICATE(CONVERT(VARBINARY(MAX), NEWID()), 65536 * 50));

UPDATE TestBlob SET Data = @data WHERE Id = 1;
A look into SQL Server Profiler shows, as expected, a huge resource allocation.

Now, let's get the same amount of data into SQL Server by using the WRITE method and sending in 10 chunks of 5MB. To keep the test valid we have to remember the current position to write at. To keep this information available over the scope of a batch we can use the the CONTEXT_INFO() which allows to store up to 128 bytes of custom information for the current session.

SET CONTEXT_INFO 0x; -- clear the context
GO
-- 10 updates, each with 5MB
DECLARE @offset INT = 
   ISNULL(CONVERT(INT, SUBSTRING(CONTEXT_INFO(), 1, 4)), 0);
--PRINT @offset;
DECLARE @data VARBINARY(MAX) = 
   CONVERT(VARBINARY(MAX), REPLICATE(CONVERT(VARBINARY(MAX), NEWID()), 65536 * 5));
DECLARE @data_length INT = DATALENGTH(@data)

UPDATE TestBlob SET Data.WRITE(@data, @offset, @data_length) WHERE Id = 2;
SET @offset += @data_length;

SET CONTEXT_INFO @offset;
GO 10 -- count of batch execution loops
Another look into Profiler shows, we get 10 statements with a really nice resource allocation.

Unfortunately I cannot provide a valid test result for read operations using the SqlDataReader.GetBytes method at the moment. This is caused by an issue of SQL Server Profiler when trying to trace BLOB data (see MS Connect MSFT:EDW - Profiler displays "Trace Skipped Records" for large (>530Kb) batch). I don't want to show test results with too small binary data since measuring inaccuracy might be too high. If you can tell me a way of how to get valid test results for this I would be happy if you let me know.

Restrictions

The current version of this class supports only single column primary keys. An exception will be thrown if a primary key consists of more than one column. Let me know, if there is a need of a version that supports composite primary keys.

It is required that the table containing the binary data has a primary key, the class doesn't work with heap tables. An exception is thrown if an accessed table does not have a primary key.

The class supports only columns from type VARBINARY(MAX). If there is a need for a VARCHAR(MAX)/NVARCHAR(MAX) based version, let me know.

The class does not handle any possible concurrent client sessions to avoid unwanted database locks. If data, accessed by a current instance of a read or write stream become modified by another user, the processed data will become most likely corrupted on client or even server side. To secure write operations use a TransactionScope surrounding the writing stream.

IBinaryData Interface

The SqlBinaryData class implements an interface called IBinaryData. This interface is not required for using this class, but I use it for unit testing purposes in business layers of a project where it allows me to work with files of a directory instead of data from a database. Feel free to remove it.

Attachments

Here you can find ZIP archive containing all classes of SqlBinaryData and a NUnit based test class.

Tuesday, August 16, 2011

ADO.NET Connection Pooling, Internals and Issues

Description

The creation of a new database connection is a relatively expensive process on server side. The server needs to allocate new memory, create a new session, allocate all needed client information, authenticate the client and so on. On the other side, it does not make sense to keep keep connections open for the whole lifetime of a application, since this causes a higher session management overhead on server side due to many concurrent connections.

To handle those two issues, ADO.NET uses connection pooling. This means, the well known SqlConnection object from .NET framework is just a wrapper for the real, internal database connection. Whenever a new SqlConnection becomes opened it asks the connection pool for an existing, and currently not used, internal connection. If an existing, free connection is available ADO.NET wont create a new database connection but reuse the existing one. As soon as the connection is not needed any more it becomes sent back into the connection pool. If a connection is not used for a longer time it becomes automatically closed by the ADO.NET connection pooler, to release resources on SQL Server. To determine if an internal connection can be used for a current request by a SqlConnection, ADO.NET compares several (not all) connection string information. This includes authentication information, database name, transaction context enlisting configuration and so forth.

This blog post focuses on ADO.NET SqlClient and SQL Server. Some of the information might be different when working with other database servers or other data providers.

Myth Close and Dispose

Many books and web resources preach that it is important to always Close and dispose a connection, to ensure that the internal connection will be sent back to the connection pool. Until .NET 3.5 even MSDN stated this, but this became fixed version 4.0. Both ways, calling the Close method or by disposing the SqlConnection will actually send the connection back to the pool. Since the using-directive provides a language integrated way to dispose a component as soon as it goes out of scope, I usually prefer this over calling the Close method in a try-catch-finally block.

Here's a little sample that shows that both ways cause a correct connection pooling:
string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;";
string sql = "SELECT connect_time FROM sys.dm_exec_connections WHERE session_id = @@SPID";
SqlConnection cn1 = new SqlConnection(cnStr);

// close the connection but don't dispose it
cn1.Open();
using (SqlCommand cmd = new SqlCommand(sql, cn1)) {
   DateTime loginTime = (DateTime)cmd.ExecuteScalar();
   Console.WriteLine("New connection: {0}", loginTime.ToString("HH:mm:ss.fff"));
}
cn1.Close();

// dispose but no close
Thread.Sleep(500);
using (SqlConnection cn = new SqlConnection(cnStr)) {
   cn.Open();
   using (SqlCommand cmd = new SqlCommand(sql, cn)) {
      DateTime loginTime = (DateTime)cmd.ExecuteScalar();
      Console.WriteLine("After Close only: {0}", loginTime.ToString("HH:mm:ss.fff"));
   }
}

// result afer dispose but no close
Thread.Sleep(500);
using (SqlConnection cn = new SqlConnection(cnStr)) {
   cn.Open();
   using (SqlCommand cmd = new SqlCommand(sql, cn)) {
      DateTime loginTime = (DateTime)cmd.ExecuteScalar();
      Console.WriteLine("After disposing only: {0}", loginTime.ToString("HH:mm:ss.fff"));
   }
}

cn1.Dispose();
Console.ReadLine();
By returning the connect_time column from sys.dm_exec_connections for our current session id we retrieve the creation time of the connection from SQL Server.

As we see, in all cases we reuse the same, internal connection to SQL Server.

How a Pooled Connection becomes Reset

Since ADO.NET does not know what happened in a previous session of a pooled connection, when it is returned to a new instance of a SqlConnection, it has to reset the connection before it can be safely reused for another session. When resetting a connection all created temporary objects become dropped, all previously allocated resources are freed and the initially connected database will be set - if it was changed in the previous session. This resetting is done by calling the SQL Server procedure sp_reset_connection.

This can cause some issues. The ADO.NET connection pooler does not reset the connection when it is sent back to the pool, but when it is returned to a new opening SqlConnection, as shown in the next sections.

In addition, the transaction isolation level will not be reset if you changed it in a previous session. If you change the transaction isolation level in some of your sessions you will need to manually reset it whenever you get a pooled connection. At the moment this behavior is by design, due to backward compatibility (see Microsoft Connect issue sp_reset_connection doesn't reset isolation level). (Thanks to Greg who suggested this!)

Temporary Tables and Named Constraints

When working with temporary tables in SQL Server, there is still an issue, names of temporary tables only need to be unique within the current connection, but names of constraints need to be unique for all current connections. Since SQL Server 2005 and following versions it is possible to define constraints without specifying a name for them. Previous versions did not support this feature and the syntax to specify a name for a table constraint is, for sure, still valid. When executing the following SQL statement in two concurrent connections you get an error message:

CREATE TABLE #t1 (
   Id INT NOT NULL CONSTRAINT PK_TEMP PRIMARY KEY CLUSTERED
)
Only the first connection is able to create the temp table, the second connection gets the following error:
Msg 2714, Level 16, State 5, Line 1
There is already an object named 'PK_TEMP' in the database.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint. See previous errors.

This also affects ADO.NET connection pooling. Say, we've got two different processes, the first creates a new entry in a database and the second one listens for those entries to process them. When the first process finishes its work without an explicit DROP of the temporary table, the connection stays open and the temporary table will stay in database until the connection is reused. If the second process tries to create the same temporary table it will receive the same exception.

Global Temporary Tables

Same problem as described for constraints of usual temporary tables applies to global temporary tables. Their name needs to be unique over all concurrent SQL Server connections. If one process creates a global temp table and does not explicitly drop it, when closing/disposing the SqlConnection, the table will stay on the server until the current process opens another connection to this server (and database). This behavior can cause confusing errors in a production system.

Release of Temporary Allocated Data

For sure, since temporary tables stay available until a connection becomes reused, this also implies that all data, stored in those tables, stay allocated until the process exists or reuses the connection.

If we have a process that creates a temporary table and fills it with hundreds of thousands of rows those data will stay in tempdb for probably longer than expected.

One way to ensure all allocated resources and temp tables become cleaned as soon as the connection is closed would be to deactivate connection pooling. However, we should always try to avoid this. The cleanest way to avoid all previously described issues is to always, explicitly drop temporary objects.

Pooled Connection became Invalid

If an connection becomes invalid while it is in the connection pool, it will still be returned for next usage. This could be caused by network problems, a restart of the SQL Server service, a manual KILL of the connection on database side or several other reasons.

Here is a little example that can be debugged to show this:
string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;";
         
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand("SELECT @@SPID", cn)) {
   cn.Open();
   Debug.Print("SPID: {0}", cmd.ExecuteScalar());
}

// at this point, restart the service or KILL the connection on server side

using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand("SELECT @@SPID", cn)) {
   cn.Open();
   Debug.Print("SPID: {0}", cmd.ExecuteScalar());
}
When debugging this code and restarting the SQL Server service after the first connection became disposed and before the second connection becomes opened you will get an exception. In my case (since I'm working with a local instance of SQL Server):
A transport-level error has occurred when sending the request to the server. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.)
At the time of this post, there is no built-in support of ADO.NET for SQL Server that tells the resource pool to verify the connection before it is returned to the SqlConnection. However, if occasional network issues or other problems, that cause an invalid internal connection, sometimes happen in your environment, you can implement your custom connection validation in a factory method that creates your connections.

string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;";
SqlConnection cn = new SqlConnection(cnStr);
cn.Open();

try {
   using (SqlCommand cmd = new SqlCommand("DECLARE @i INT", cn)) {
      cmd.ExecuteNonQuery();
   }
}
catch (SqlException ex) {
   if (ex.Message.StartsWith("A transport-level error has occurred ")) {
      cn = new SqlConnection(cnStr);
      cn.Open();
   }
   else
      throw;
}
return cn;
As you see, the connection sends a tiny SQL statement to the server before it is returned to the caller. If a transport-level exception occurs, the internal connection becomes removed from pool and another new connection will be created and returned. You can replace the "DECLARE @i INT;" SQL statement with an empty stored procedure called usp_VerifyConnection. This might be slightly faster (not tested, though), but will definitely look less confusing in SQL Server traces ;-).

Control the Connection Pool over Connection String

The SqlConnection.ConnectionString provides several attributes that allow to control the behavior of ADO.NET pooling.

Connection Lifetime can be used to specify a maximum life time for a connection in seconds. When a connection is returned to the pool this value will be compared to the (client side) creation time of the connection. If the connections lifetime is timed out, it will be closed and not put into pool. Attention, this implies if a connection with connection is sent back into the pool before its lifetime elapsed and not used for a long time, it will still stay opened and reused for one more time!
// connection string with a lifetime of 10 seconds
string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;connection lifetime=10";
string sql = "SELECT connect_time FROM sys.dm_exec_connections WHERE session_id = @@SPID";

// create a new connection with 10 seconds lifetime and send it back to the pool
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}

// wait 15 seconds
Thread.Sleep(15000);
// aquire a new connection and release it
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}

// create another connection
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}
And here the output of this test:
Connect time: 12:47:07.617
Connect time: 12:47:07.617
Connect time: 12:47:22.837
As we see, even if the specified connection lifetime is elapsed, the second connection will get the pooled connection. As long as the elapsed connection is not sent back to the pool it will stay active. Only the third connection retrieves a new created connection, since the connections lifetime was already elapsed as it was sent to the pool.

The default value for Connection Lifetime is zero, what means a maximum lifetime.

Min Pool Size and Max Pool Size can be used to specify a minimum/maximum count of pooled connections. The default value for Min Pool Size is zero, the default value of Max Pool Size is 100. When trying to use more concurrent connections than specified value of Max Pool Size you will get an exception.

The Pooling attribute (default 'true') says if connection pooling shall be used for the created connection or not. If you use a SqlConnection with pooling 'false' the database connection will always be newly created and closed after using it.
// connection string with "pooling=false"
string cnStr = "server=(local);database=Sandbox;trusted_connection=sspi;pooling=false";
string sql = "SELECT connect_time FROM sys.dm_exec_connections WHERE session_id = @@SPID";

// create a new connection and print the current session id
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}

// create a new connection and print the current session id
using (SqlConnection cn = new SqlConnection(cnStr))
using (SqlCommand cmd = new SqlCommand(sql, cn)) {
   cn.Open();
   DateTime connectTime = (DateTime)cmd.ExecuteScalar();
   Debug.Print("Connect time: {0}", connectTime.ToString("HH:mm:ss.fff"));
}

And here is the output:
Connect time: 13:41:32.790
Connect time: 13:41:33.050

As we see, the connect time on server side is different for both connections where it would be equal if we would have reused the same internal connection for the second SqlConnection.

Enlist (default 'true') attribute is a more special than the other connection pooling related attributes of connection string. If you don't work with distributed systems and/or Systems.Transactions namespace, you wont need this attribute. The Enlist attribute is important if you need to control how to span transactions over more than one transactional sub-system (like database servers, WCF services, ...). If Enlist is set to 'true' and a parent transaction context is available for the static property System.Transactions.Transaction.Current, ADO.NET will automatically register the SQL connection and its transaction in this transaction context. If you commit your database transaction (implicit or explicit) the transaction will stay pending until the parent transaction becomes committed or rolled back. Today, the most common way to manage those parent transactions is utilizing the TransactionScope class.

A hypothetical system, that might use transaction scopes, is a order management system that automatically creates a shipping order by utilizing a suppliers web-service when an order becomes created or approved. Here we need to update/create new data in a local database and create a shipping order over a web-service within one transaction. If both systems support (and allow) transactions a transaction scope can be used to ensure that all data are either written or not.

If Enlist is set to 'false', a new connection will not be enlisted to an existing transaction scope and behave like no transaction scope was available. All database transactions become committed, independent if the parent transaction will fail or succeed.

Control the Connection Pool over static SqlConnection methods

In addition to the control mechanisms provided by the connection string, ADO.NET SqlConnection provides two static methods to partially, or completely clear the current pool.

SqlConnection.ClearPool(SqlConnection) clears all pooled connections that match the provided connection - by its connection string.

SqlConnection.ClearAllPools clears all currently pooled SQL Server connections.

Pool Fragmentation

There are two more possible issues with ADO.NET connection pooling. Though, I find both are well described at SQL Server Connection Pooling (ADO.NET), so I just quote the paragraphs for sake of completeness.

Pool Fragmentation Due to Integrated Security (Windows Authentication)
Connections are pooled according to the connection string plus the user identity. Therefore, if you use Basic authentication or Windows Authentication on the Web site and an integrated security login, you get one pool per user. Although this improves the performance of subsequent database requests for a single user, that user cannot take advantage of connections made by other users. It also results in at least one connection per user to the database server. This is a side effect of a particular Web application architecture that developers must weigh against security and auditing requirements.
Pool Fragmentation Due to Many Databases
Many Internet service providers host several Web sites on a single server. They may use a single database to confirm a Forms authentication login and then open a connection to a specific database for that user or group of users. The connection to the authentication database is pooled and used by everyone. However, there is a separate pool of connections to each database, which increase the number of connections to the server.

This is also a side-effect of the application design. There is a relatively simple way to avoid this side effect without compromising security when you connect to SQL Server. Instead of connecting to a separate database for each user or group, connect to the same database on the server and then execute the Transact-SQL USE statement to change to the desired database. The following code fragment demonstrates creating an initial connection to the master database and then switching to the desired database specified in the databaseName string variable.

// Assumes that command is a SqlCommand object and that
// connectionString connects to master.
command.Text = "USE DatabaseName";
using (SqlConnection connection = new SqlConnection(
  connectionString))
  {
    connection.Open();
    command.ExecuteNonQuery();
  }
Two short things to the second quote. I would not restrict this possible issue to web applications. This kind of pool fragmentation can also happen on application servers that work with many databases on same database server. I don't know why MSDN suggests to use a explicit SqlCommand to change the database, instead of using the SqlConnection.ChangeDatabase instance method.

Sources

Wednesday, August 10, 2011

Generic C# Resource Pool

I guess this becomes another almost code-only blog.

Definition

A resource pool is a class that manages a synchronized access of (theoretically) infinite callers of a restricted count of resource items.



One reason to use a resource pool could be a integrated sub system that is restricted by its maximum count of concurrent accessing threads. This could be a web service with a restricted count of sessions. Another reason for a resource pool is a resource that becomes working slow when it becomes accessed by to many concurrent connections. This could be a web-service, a database or a farm of processing servers. A third reason for a resource pool could be a resource that takes long to become initialized, but can be reused. This could be database connections, XSLT transformation classes, custom XML serializers or many other classes.
Some well known .NET framework implementations of resource pools are:

Implementation

A few month ago I wrote myself a generic resource pool class that helps me to get those kinds of bottle necks managed.

Here is my implementation of a generic, thread safe C# resource pool. All source code files, as well as some NUnit tests, are also attached as downloadable files at bottom of this post.

// ===================================================================
// PoolItem<T>
// ===================================================================

/// <summary>
/// Represents an item in the <see cref="ResourcePool{T}"/>
/// </summary>
/// <typeparam name="T">The type of the resource to be hold</typeparam>
public sealed class PoolItem<T> : IDisposable where T : class {
   internal PoolItem(ResourcePool<T> pool, T resource) {
      _pool = pool;
      _resource = resource;
   }

   private T _resource;
   private readonly ResourcePool<T> _pool;

   public static implicit operator T(PoolItem<T> item) {
      return item.Resource;
   }

   /// <summary>
   /// Gets the resource hold by this resource pool item
   /// </summary>
   public T Resource { get { return _resource; } }

   /// <summary>
   /// Disposes this instance of an resource pool item and sends the resource back
   /// to the pool
   /// </summary>
   public void Dispose() {
      _pool.SendBackToPool(_resource);
      _resource = null;
   }
}


// ===================================================================
// ResourcePool<T>
// ===================================================================

/// <summary>
/// Generic pool class
/// </summary>
/// <typeparam name="T">The type of items to be stored in the pool</typeparam>
public class ResourcePool<T> : IDisposable where T : class {
   /// <summary>
   /// Creates a new pool
   /// </summary>
   /// <param name="factory">The factory method to create new items to
   ///  be stored in the pool</param>
   public ResourcePool(Func<ResourcePool<T>, T> factory) {
      if (factory == null)
         throw new ArgumentNullException("factory");
      _factoryMethod = factory;
   }

   private readonly Func<ResourcePool<T>, T> _factoryMethod;
   private ConcurrentQueue<PoolItem<T>> _freeItems = 
      new ConcurrentQueue<PoolItem<T>>();
   private ConcurrentQueue<AutoResetEvent> _waitLocks = 
      new ConcurrentQueue<AutoResetEvent>();
   private ConcurrentDictionary<AutoResetEvent, PoolItem<T>> _syncContext =
      new ConcurrentDictionary<AutoResetEvent, PoolItem<T>>();

   public Action<T> CleanupPoolItem { get; set; }

   /// <summary>
   /// Gets the current count of items in the pool
   /// </summary>
   public int Count { get; private set; }

   public void Dispose() {
      lock (this) {
         if (Count != _freeItems.Count)
            throw new InvalidOperationException(
               "Cannot dispose the resource pool while one or more pooled "
               + "items are in use");

         foreach (var poolItem in _freeItems) {
            Action<T> cleanMethod = CleanupPoolItem;
            if (cleanMethod != null)
               CleanupPoolItem(poolItem.Resource);
         }

         Count = 0;
         _freeItems = null;
         _waitLocks = null;
         _syncContext = null;
      }
   }

   /// <summary>
   /// Gets a free resource from the pool. If no free items available this method 
   /// tries to create a new item. If no new item could be created this method 
   /// waits until another thread frees one resource.
   /// </summary>
   /// <returns>A resource item</returns>
   public PoolItem<T> GetItem() {
      PoolItem<T> item;

      // try to get an item
      if (!TryGetItem(out item)) {
         AutoResetEvent waitLock = null;

         lock (this) {
            // try to get an entry in exclusive mode
            if (!TryGetItem(out item)) {
               // no item available, create a wait lock and enqueue it
               waitLock = new AutoResetEvent(false);
               _waitLocks.Enqueue(waitLock);
            }
         }

         if (waitLock != null) {
            // wait until a new item is available
            waitLock.WaitOne();
            _syncContext.TryRemove(waitLock, out item);
            waitLock.Dispose();
         }
      }

      return item;
   }

   private bool TryGetItem(out PoolItem<T> item) {
      // try to get an already pooled resource
      if (_freeItems.TryDequeue(out item))
         return true;

      lock (this) {
         // try to create a new resource
         T resource = _factoryMethod(this);
         if (resource == null && Count == 0)
            throw new InvalidOperationException("Pool empty and no item created");

         if (resource != null) {
            // a new resource was created and can be returned
            Count++;
            item = new PoolItem<T>(this, resource);
         }
         else {
            // no items available to return at the moment
            item = null;
         }

         return item != null;
      }
   }

   /// <summary>
   /// Called from <see cref="PoolItem{T}"/> to free previously taked resources
   /// </summary>
   /// <param name="resource">The resource to send back into the pool.</param>
   internal void SendBackToPool(T resource) {
      lock (this) {
         PoolItem<T> item = new PoolItem<T>(this, resource);
         AutoResetEvent waitLock;

         if (_waitLocks.TryDequeue(out waitLock)) {
            _syncContext.TryAdd(waitLock, item);
            waitLock.Set();
         }
         else {
            _freeItems.Enqueue(item);
         }
      }
   }
}

As you see, the ResourcePool<T> returns PoolItem<T> objects that hold a reference of the pooled resources. The pool does not need to become initialized with any resource items, new resource items become lazy initialized when (if) needed. Therefore the pool requires a factory method provided at construction. This method becomes called whenever a new resource is requested and no free resources are currently available. If the factory method returns null the requesting thread becomes suspended until another thread releases a used resource.

Usage

The pool can be used out of the box.
class MyResource {
   public void DoSomething() { }
}

[Test]
public void SampleUsage() {
   ResourcePool<MyResource> pool = new ResourcePool<MyResource>(CreateResource);

   using (var poolItem = pool.GetItem()) {
      MyResource resource = poolItem.Resource;
      resource.DoSomething();
   }
}

private static MyResource CreateResource(ResourcePool<MyResource> pool) {
   return pool.Count < 3 ? new MyResource() : null;
}

However, if used at many positions in your system I'd suggest to wrap it into a custom pool class. This pool can return a wrapper of the real resources that provide a more specific interface to the consumer. This wrapper can hold an internal reference to a PoolItem<T> and implement IDisposable to free the resources back into the pool when not needed anymore.
// ==========================================================
// The real resource
class InternalResource {
   public void DoSomething() { }
}

// ==========================================================
// The external wrapper returned to the consumer
class MyResource : IDisposable {
   private PoolItem<InternalResource> _poolItem;

   public MyResource(PoolItem<InternalResource> poolItem) {
      _poolItem = poolItem;
   }

   public void DoSomething() {
      _poolItem.Resource.DoSomething();
   }

   public void Dispose() {
      _poolItem.Dispose();
      _poolItem = null;
   }
}

// ==========================================================
// The custom pool
class MyPool {
   private ResourcePool<InternalResource> _pool;

   public MyPool() {
      _pool = new ResourcePool<InternalResource>(CreateResource);
   }

   public MyResource GetItem() {
      return new MyResource(_pool.GetItem());
   }

   private InternalResource CreateResource(ResourcePool<InternalResource> pool) {
      return pool.Count < 3 ? new InternalResource() : null;
   }
}

// ==========================================================
// Sample usage
[Test]
public void SampleUsage() {
   MyPool pool = new MyPool();

   using (MyResource resource = pool.GetItem()) {
      resource.DoSomething();
   }
}

Here are the files for download:

I hope the pool can help you to solve a few of your issues, as it did to me.

Wednesday, August 3, 2011

Object reference not set to an instance of an object

We all know, and love, this meaningful message from somewhere inside of a system.
Luckily I can say, in my last projects I became able to almost eliminate those failures - without a 100% unit test code coverage. (Those outside there who say, 100% code coverage by unit tests is a must must, please don't flame me. I'm a big fan of well tested software, but I guess I'm not yet ready for 100% coverage ;-).)

My solution for this old enemy was much easier than some might think. It was done by a simple coding convention. Don't get me wrong, I don't like hundreds of rules about how to write code. In my current project we've got less than 10 project internal coding conventions. Developing large software systems is some kind of a creative work and I think to many rules can eliminate the creativity of developers. I say every experienced developer should be able to write code that others can read and should be able to read code of others.

Our new rule to get rid of null reference exceptions was simply:

Get-methods never return null. They return a valid value or it throws an exception.

This does not mean that our system throws thousands of exceptions per minute. It also means that it is still possible to unsuccessfully try to get data from a method without an exception, we just do this different.

In old projects, code like this was a main part of daily business:
Customer customer = dataSource.GetCustomerById(1234);
if (customer != null) {
   customer.DoSomething();
}

This is nice code and works perfectly. Though, what if someone (me?) ever forgets to check if the method returned <null>? In this case it becomes only a question of time to see our old friend in a log file or - even worse - in a dialog shown to the user.
To avoid try-catch blocks over and over and avoid NullReferenceExceptions we simply established the TryGet pattern used by .NET classes like the Dictionary<T,V>. Instead of a possibly null returning Get-method we use a TryGet-method that returns a bool value indicating if our try to get something was successful or not.
public bool TryGetCustomerById(int id, out Customer customer) {
   using (var connection = CreateOpenConnection())
   using (var command = CreateProcCommand(connection, "GetCustomerById")) {
      command.Parameters.Add("@id", SqlDbType.Int).Value = id;

      using (var reader = command.ExecuteReader(CommandBehavior.SingleRow)) {
         if (reader.Read()) {
            customer = new Customer();
            customer.Id = id;
            customer.Name = (string)reader["Name"];
         }
         else {
            customer = null;
         }
      }
   }
   return customer != null;
}

Plus, if really needed, a Get-method that utilizes the TryGet-method and throws an exception if the search was unsuccessful.
public Customer GetCustomerById(int id) {
   Customer customer;
   if (!TryGetCustomerById(id, out customer))
      throw new EntityNotFoundException(
         string.Format("Cannot find customer with id {0}", id));
   return customer;
}

However, in many cases we don't need the Get-methods any more and in a few other cases we don't need the TryGet-methods since something must be wrong if the requested information is not available and a (meaningful) exception is a must.

This simple change to a TryGet-method changes our consuming library to something like this:
Customer customer;
if (dataSource.TryGetCustomerById(123, out customer)) {
   customer.DoSomething();
}
Some might argue "we still need the check for a result", and they are correct. But, this is a psychological thing. Calling a method that is called Try... clearly tells the consumer, that this call can return without a result. Even more. Sometimes, when it is okay if the method does not return a value, a developer might leave out the if-condition at the point where (s)he calls a TryGet-method. In those cases I saw other developers asking "But you call a Try-method and you don't check for the null value, are you sure??".

In cases where a method can return more than one item, like GetCustomersForCountry, we say it is valid to return an empty list. This is because of two reasons, an empty list is not null and most of the time those calls are followed by a loop that will not be entered for an empty list. However, if we call a method that returns a list and we only need the first item we should consider to add an additional method that returns only one item (like GetFirstCustomerForCountry). Another solution is an extension method "IEnumerable<T>.TryGetFirst(out T item)" and we are back in our style.

For sure, the samples I used here, to get data from some kind of data source are only one of many parts in a system where we find Get-methods. We use this pattern for all methods we write, independent if it works with a data source or any other objects. And we are doin' good with it.

As I said, every now and then we still see a NullReferenceExcption, but I guess there have been about 2 or 3 over the last 5 months of intensive development and testing. All in development or test environment, none in production. Plus, if this happens, we know there are only two options how to fix it. Either add the missing if-clause around the call of a TryGet-method (didn't happen over the last year) or change the null returning Get-method.

Probably I'm the only one who ever had problems with null reference exceptions but, for some reason, it feels like I'm not ;-). I can say we are truly on a point where we can call this old problem (almost) gone.