An Efficient Memory Stream, September 29, 2009

Loring Software, Inc

A Software Developer's Notebook

Some things that I do as a programmer have stuck with me despite all the changes in programming over the years. Take memory allocation. Back in 1987, despite only one program running at a time on a PC, you had to make everything fit in 512K. Now you have almost unlimited memory available to you as a programmer. You can't even take a photo smaller than 512K anymore. But when I work with memory, I can't help but think about the blocks of bytes moving around, and how I should operate on them in the most efficient way. 20 years ago, you had no choice. Your program crashed when you ate up all the memory. But now you can be completely ignorant of memory allocation as a programmer, and, except in special circumstances, get away with it.

A recent project I was working on is a good example. I am calling a web service which fetches a zip file, and then extracting the zip to a directory. The web service, pulls the zip from a web site, so it retrieves it in blocks of byte[]. The library I am using (DotNetZip) to extract the file from the zip is expecting a single byte[], or a stream of some sort. But here I am with a byte[][].

The easy way to get the blocks of bytes is to retrieve them all, then allocate a big buffer to hold them all, and copy the blocks into it. The 1987 programmer in me just couldn't stomach that. So, how am I to get a byte[][] to the library? I could write them to a file, and load them back in now that I know the file size. But that seems a waste of time. Plus, I don't want my Web Service to have to be able to write to disk.

The solution I used was to create a memory stream that could work with a byte[][], instead of the standard byte[]. This way, my bytes stay put, never having to be reallocated. I started by copying the definition of the MemoryStream, but replaced the constructor to take an array of byte[].

    public class MemoryStreamArray : MemoryStream
    {
        private long m_position = 0;
        private int m_bufferNumber = 0;
        private byte[][] m_buffer;
        private int m_capacity;
        private bool m_isDisposed = false;

        public MemoryStreamArray(byte[][] buffer)
        {
            if (buffer == null) throw new ArgumentNullException("buffer", "buffer cannot be null");
            m_buffer = buffer;
            m_capacity = 0;
            foreach (byte[] a_bytes in m_buffer)
                m_capacity += a_bytes.Length;
        }

The key to letting the Zip library operate on the MemoryStreamArray to propery read and seek within the buffer:

        public override int Read(byte[] buffer, int offset, int count)
        {
            if (buffer == null) throw new ArgumentNullException("buffer");
            if (offset < 0) throw new ArgumentOutOfRangeException("offset");
            if (count < 0) throw new ArgumentOutOfRangeException("count");
            if (buffer.Length - offset < count) throw new ArgumentException("more bytes are being asked to be copied than space available in buffer from the offset");

            int a_bufferPosition = offset;
            int a_count = 0;
            long a_thisBufferStart = 0;
            for (int a_ix = 0; a_ix < m_bufferNumber; a_ix++)
                a_thisBufferStart += m_buffer[a_ix].Length;

            while (m_position < m_capacity && a_count < count)
            {
                if (m_position - a_thisBufferStart >= m_buffer[m_bufferNumber].Length)
                {
                    if (m_bufferNumber + 1 >= m_buffer.Length)
                        break;
                    a_thisBufferStart += m_buffer[m_bufferNumber].Length;
                    m_bufferNumber++;
                }
                buffer[a_bufferPosition] = m_buffer[m_bufferNumber][m_position - a_thisBufferStart];
                a_bufferPosition++;
                a_count++;
                m_position++;
            }
            return a_count;
        }

        public override long Seek(long offset, SeekOrigin loc)
        {
            if (offset > Int32.MaxValue)
                throw new ArgumentOutOfRangeException("offset", "offset is greater than System.Int32.MaxValue");
            if (m_isDisposed)
                throw new ObjectDisposedException("The current stream instance is closed");

            long a_position = m_position;
            switch (loc)
            {
                case SeekOrigin.Begin: a_position = offset; break;
                case SeekOrigin.Current: a_position += offset; break;
                case SeekOrigin.End: m_position = a_position + offset; break;
                default: throw new ArgumentException("invalid System.IO.SeekOrigin", "loc");
            }

            if (a_position < 0)
                throw new IOException("Seeking is attempted before the beginning of the stream");

            return (Position = a_position);
        }

The Position property will correctly keep track of the current buffer number

        public override long Position
        {
            get { return m_position; }
            set
            {
                m_position = value;
                long a_count = 0;
                m_bufferNumber = 0;
                while (a_count < m_position)
                    a_count += m_buffer[m_bufferNumber++].Length;
            }
        }

The rest of the functions are pretty obvious. Since I only want to read the buffer, I can throw a lot of exceptions in functions like Write() and GetBuffer().

While this runs pretty fast, and is mostly "reallocation" free, there is still oportunity for enhancements. For example, the web service retrieves the original zip file in chunks, but allocates the whole file before returning it as a byte[][]. We could serve it back to the calling program in chunks as well. But that introduces some interesting issues. Considering the maintenance, I chose to save that for another day.