
In the quest for finer-grained understanding of DNA and other biological sequence information, bioinformatics and computational biology push existing information technology to its limits. While there are many software tools for DNA analysis, BLAST or Basic Local Alignment Search Tool is one of the most common. Blast was created and is maintained by the U.S. National Center for Biotechnology Information (NCBI). As a distributed application, BLAST runs across hundreds or thousands of simultaneous CPUs. At the same time, its comparative algorithms demand significant amounts of small, random I/O operations that can challenge conventional storage systems. This pattern also occurs across many other bioinformatics applications.
With the availability of large compute grids, adequate horsepower has been harnessed to tackle the compute needs bioinformatics applications, however, the disk storage infrastructure has not kept up resulting in,
- A need to replicate memory across individual clients for performance
- Excessive administration and tuning just to maintain minimum uptime thresholds
- Scalability limitations as applications require massive amounts of data and I/O throughput far exceeding the capability of a single storage server
- Forced to add capacity to compensate for limited throughput (IOPs)
- Missed project schedules and potential for lost revenue
To compensate for this I/O bottleneck, application deployments often rely on techniques that are no longer fully effective:
- Client workloads are not 100% evenly distributed, resulting in over-provisioning of memory resources on each client
- Molecular sequence database sizes now exceed the memory and cache capacity of individual clients or servers
Gear6 addresses this widening I/O performance gap between compute and storage resources with centralized storage caching, an innovative approach that complements existing storage infrastructure with high performance caching appliances. The many benefits of centralized storage caching include:
- Blazing I/O response times – data served from cache in microseconds compared to milliseconds from disk
- Scalable cache architecture that reaches terabytes of cache capacity
- Appliance design fits seamlessly into file-based storage infrastructures without requiring a change of existing storage servers or clients
- Complement scalable computer farms with a caching solution capable of serving data in parallel to hundreds or thousands of storage clients over standard Gigabit Ethernet
- Read-only workloads from bioinformatics applications are easily accelerated through caching
- Centralized caching complements your existing primary disk-based storage infrastructure by lowering maintenance costs, and increasing overall uptime of the storage and computing system
- Gear6 solutions make use of intelligent caching to minimize or eliminate reconfiguration. Once installed, no setup or teardown is required to instantly shift between processing jobs