A Cinnober Whitepaper On Latency Oct 2009

Stock Exchange, Latency
View more...
   EMBED

Share

Preview only show first 6 pages with water mark for full document please download

Transcript

A Cinnober white paper on: Latency October 2009 update © Copyright 2009 Cinnober Financial Technology AB. All rights reserved. Cinnober Financial Technology AB reserves the right to make changes to the information contained herein without prior notice. No part of this document may be reproduced, copied, published, transmitted, or sold in any form or by any means without the expressed written permission of Cinnober Financial Technology AB. Cinnober® and TRADExpress™ are trademarks or registered trademarks of Cinnober Financial Technology AB in Sweden and other countries. Other product or company names mentioned herein may be the trademarks of their respective owners. © Cinnober Financial Technology AB 2 (16) Latency revisited Today speed is crucial to any marketplace that wants to stay competitive. At the same time, with high-frequency trading gaining an increasing share of overall volumes, the ability to manage constantly rising transaction volumes is also a necessity. In 2007, Cinnober published a white paper which established some best practices for measuring latency in financial markets and publishing the results in a clear and understandable fashion. Since that paper was published, latency has become the most widely-used metric. We also disclosed benchmark figures with a level of transparency seen neither before nor since, showing that the context of the latency testing environment is of the utmost importance. In this paper we continue to explore the measurement of latency and, more importantly, what can be done to minimize it. We detail our test configurations and show how these affect the trade-off between latency and throughput. In our latest published benchmarks on a full-blown TRADExpress Trading System, we achieved a door-to-door latency of 286 microseconds and a business logic latency of 138 microseconds. We also publish our roadmap to further reduce latency, the goal of which is to go below 80 microseconds door-to-door within a year and 25 microseconds within 18 months. Recap In our previous white paper, we outlined the factors affecting latency:  Measuring methodology: where is the start and finishing line?       End-to-end―measure at trading venue client Door-to-door―measure at gateway to trading venue Business logic only―measure at matching engine Business model: how complex is each transaction? Transaction model: is the transaction safeguarded or not? Hardware and network architecture End user perspective For the trading venue client, end-to-end latency is naturally the most relevant to assess. However, door-to-door measurements are the only ones which can be used in unbiased comparisons of trading platforms, and are thus the measurements we use. Co-located clients have end-to-end latencies very similar to the door-to-door latency. We must also point out that performance figures will naturally improve with time given that “state of the art” network and computer technologies also continue to improve. This offers both opportunities and challenges. Having a flexible solution deployable on almost any platform, Cinnober’s customers are well positioned to immediately exploit any new server or network technology. The hard figures we publish are measured using current technology, not hypothetical extrapolations. © Cinnober Financial Technology AB 3 (16) Architecture for performance All performance increase comes at a cost, whether it is purely financial (buying better hardware or engineering) or by sacrificing features. By investigating the factors affecting latency we can establish limits on the best possible performance, and more importantly, what it will cost. Functional blocks of a marketplace For the scope of this paper, it is necessary to establish a terminology framework for the basic components of a marketplace system: Name Trading venue client Gateway or Access Point Routing layer Matching engine Abbr. Client GW/AP Rt ME Purpose/Comment For benchmark purposes replaced by a load generator. The connection node between marketplace and client. Used for load balancing or transaction routing. The core of the marketplace. In addition to the above, a redundancy system may be used. This can be either hot or cold, and can consist of a complete system or selected components. It can be located at a disaster recovery site or co-located. All of the above components, except the redundancy system, must necessarily be present in any marketplace system. Additional components, such as back office processing, may also be needed but are beyond the scope of this paper. Processing tiers As a transaction request is sent from the client to the heart of the marketplace, the matching engine, it passes through various processing tiers: AP AP AP AP Client Access Point Rt Rt ME Rt ME Rt Routing layer Matching engines Redundancy systems ME ME ME ME © Cinnober Financial Technology AB 4 (16) To reduce system complexity, some or all of these functional blocks can be integrated in one physical server, either horizontally, e.g.: AP AP AP AP Client Access Point Rt Rt ME Rt ME Rt Routing layer Matching engines Redundancy systems ME ME ME ME or vertically, e.g.: AP AP AP AP Client Access Point Rt Rt ME Rt ME Rt Routing layer Matching engines Redundancy systems ME ME ME ME or a combination thereof, including full virtualization, e.g.: AP AP AP AP Client Access Point Rt Rt ME Rt ME Rt Routing layer Matching engines Redundancy systems ME ME ME ME © Cinnober Financial Technology AB 5 (16) In all cases, all components are still there, but the communication between them changes. Horizontal integration reduces latency directly by removing network hops between nodes. Vertical integration can only reduce latency indirectly by reducing network complexity, but is more useful as a redundancy solution or by providing scalability if the system has spare capacity. Measuring method The difference between measuring latencies as door-to-door or end-to-end may seem purely technical, but it is the same force which has given rise to colocation. By hosting the client within the same site as the marketplace, it gains a significant advantage over any client who isn’t co-located. Trading venue AP Client Rt ME ME End-to-end Door-to-door Business logic The latency incurred by transmitting outside the marketplace can be anything between tens of microseconds and up to several seconds, depending on the technology used for connectivity. The vast spread of these figures is the main reason why directly comparing any specific client's performance figures for different marketplaces might be meaningless if they use different technologies for connectivity. Trading venue AP Co-located client Door-to-door End-to-end Business logic Rt ME ME Since Cinnober has no control over the trading venue client's set-up, door-to-door measurements are the most relevant. In the TRADExpress Trading System, this means measuring the time which elapses between a request being received at the AP and a corresponding response arriving back at the same AP after being processed by the matching engine. Business model The business model of the trading venue can have an equally vast impact on latency. Complex order types might require longer processing times. Furthermore, if the business model should require the implementation of systemwide locks for any transactions, they can potentially impact the performance of all other concurrent transactions. © Cinnober Financial Technology AB 6 (16) For a trading venue, there is thus a choice between offering a complex model, which offers traders a rich set of trading products and functions, and a simpler model, which can offer faster execution. This is a trade-off which should be addressed by trading venue operators. Transaction model The transaction model is the technical architecture ensuring data processing quality: 1. By safe-guarding the integrity of the data against loss―whether it is network, host or data centre (disaster) related. 2. By providing a fast execution path. Redundancy The established method among marketplaces to ensure data integrity has been to commit recovery data to both local servers, log files and remote standby servers before any response is returned. However, recently the use of synchronized standby servers at remote centers has been questioned, precisely because this impacts negatively on latency. Trading venue AP Rt Recovery data storage ME ME Synchronous or asynchronous transmission of redundancy data to remote servers. Instead, today some trading venues deploy co-located standby servers and asynchronous remote emergency backup servers. An emergency fail-over in this set-up requires roll-back of all transactions done at the primary site but not yet committed to the backup site, e.g. busting of trades. In a real emergency, any financial costs incurred by being forced to bust these trades are probably negligible compared to the cost of the disaster itself. Parallelization Fast execution is ensured by providing a highly parallel and pipelined architecture. Note however that ever since Amdahl's Law 1 gained acceptance in the industry, it has been a well-known fact that scalability through parallelization has certain limits depending on the amount of work which actually can be executed in parallel. Cinnober's TRADExpress technology is highly parallelized and applies modern queue theory for best performance. Data safety evaluated as a financial risk The cost vs. latency gain of the chosen transaction model is hard to quantify, but the data safety parameter is probably the easiest to understand since it can be 1 Cf. e.g. http://en.wikipedia.org/wiki/Amdahl's_law © Cinnober Financial Technology AB 7 (16) formulated in commercial terms. Thus, if we forego some data safety, e.g. by using asynchronous standby servers, the probability of losing a transaction can be assigned a financial risk. It might make sound financial sense to move the cost of data safety in the transaction model to a cost for a financial risk in the business model if the shortened latency offers increased value to marketplace customers. Hardware and Network architecture Every CIO/CTO is well acquainted with the task of selecting hardware and network architectures. It is easy to evaluate the cost simply by requesting quotes, performance numbers are easily available and in theory it should not really be hard to find an optimal solution. In practice, though, it can be. A typical example is the choice between buying a small number of high-performance servers and a larger number of smaller servers, or between a low-latency but short-range network, and a standard network. You can achieve high system performance with either solution, albeit of different characteristics, but you simply can't avoid that top-of-the-line performance costs. Maintaining high performance, i.e. staying on top by continually replacing and upgrading the equipment, costs even more. If you're looking for a sweet spot the technology choice would probably be a conservative and relatively cost-effective alternative. On the other hand, a marketplace facing exceptional demands may go for the most specialized and advanced technology available and not shirk the substantial investments necessary, especially not for keeping the system up-to-date. © Cinnober Financial Technology AB 8 (16) The best possible performance Latencies on different hardware Latency is intimately tied to network architecture and less so to the server. We can easily measure the response time of the network itself by simply bouncing packets between interconnected nodes, which gives us the following rough estimate of the network latency between two nodes: Technology Typical round-trip, microseconds (µs) Shared memory interconnect “Raw packets” on Infiniband SDP on Infiniband TCP over Infiniband TCP/IP on 10 GB Ethernet TCP/IP on 1 GB Ethernet <1 µs <10 µs 20 µs 50 µs 50 µs 100 µs The pure acquisition cost of Ethernet or Infiniband is roughly the same. Ethernet supports longer distances while Infiniband is typically used for co-location. Note that the highest speed gains with Infiniband can only be achieved by specialized low-level coding which increases the complexity of the software. Also note that the figures above represent times between two nodes only. If a transaction has to pass between more than two nodes, the latency is multiplied by the number of hops. If we have any business gateway of any sort, e.g. for order routing, between the client and the matching engine, the end-to-end latency in any system using 1 GB Ethernet is at least 200 µs, assuming an infinitely fast business logic in the gateway and the matching engine and the client being co-located. Middleware It is important to understand that middleware is just software. It cannot by itself increase network speed; the middleware always relies on the underlying network hardware. It can offload some of the complexity otherwise required in the application software to maintain data integrity or simplify data routing, but it is not a magic wand to increase performance. TRADExpress Trading System’s architecture allows for fast and efficient integration with any commercially available middleware product. When evaluating performance, we have not yet found any that offers an advantage we can't achieve in-house. Evaluating performance claims The above data let us understand that any claims for systems having latencies in the order of 100 µs or substantially below means they must have been measured i) internally on a single super computer, ii) using non-standard network © Cinnober Financial Technology AB 9 (16) technologies, iii) only between two nodes, iv) without synchronous standby safe storing, v) only within the matching engine, i.e. the business logic latency, etc. As seen previously, network delays set hard limits on how low end-to-end latency can go. Using 1 GB Ethernet, latency cannot be lower than 100 µs, assuming an extremely simple system model without routing or lower than 200 µs with routing. Using the fastest network technology commercially available today and specially coded software, it cannot be lower than 10–20 µs. Both examples assume an infinitely fast business logic. Throughput System throughput, measured as the aggregated capacity of the marketplace, is in practice infinitely scalable since transactions in different instruments are independent. It is just a matter of adding more gateways and servers. Throughput figures are thus rather pointless, since any reasonably well designed system can reach any numbers. Nevertheless, large numbers are impressive and are thus popular. Single instrument throughput would give much more interesting comparison data, but such numbers are sadly not in vogue. On the other hand, for a given system the maximum throughput and the minimum latency are intimately tied together. Imagine a system test where the load is gradually increased and throughput and latency measured continuously: Throughput Latency Load Saturation load As the system load increases, throughput will increase in a linear fashion and latency will stay almost constant until the system reaches saturation load. Throughput then stays at its peak and latency increases. If the load continues to increase much more, the system will eventually degrade. © Cinnober Financial Technology AB 10 (16) Cinnober's latest benchmarks To illustrate the performance discussion in this paper we would like to disclose how we have arrived at some of our benchmarks. In June this year Cinnober carried out tests on the TRADExpress Trading System performance on commodity servers, which achieved door-to-door latency of 286 microseconds and a throughput capacity of over 800 000 quote updates per second. Performance can be increased at will by expanding the infrastructure. Our goal was to use standard equipment and to maximize performance on this given platform. Details of how the benchmarks were set up and run follow below. Off-the-shelf components A prerequisite of the test was to use a test environment consisting of standard commodity hardware and software:  An HP c7000 Blade enclosure, with a mix of      BL685c-G5 AMD “Shanghai” blades (2700 MHz), 4×4 cores, 5 units BL460-G6 Intel “Nehalem” blades (2900 Mhz), 2×4×2 cores (HT), 2 units 1 GB Ethernet over an external HP Procurve 2900 switch Infiniband, 4 x DDR capable of 16 Gbit/s data rate Linux (CentOS 5.2)  A standard build of the TRADExpress Trading System version 7.3.1 We configured the Infiniband to use IP over IB and the transparent TCP/SDP layer which enabled us to utilize SDP without any code changes. Doing so, we achieved an inter-node latency of only 50 µs. We could have gone even lower by re-coding the communication layer of TRADExpress, but then we would not have used standard components. Please note that the TRADExpress system used is a standard build implementing a normal business model sufficient to support a trading venue complete with an order router layer, dark pool, peg orders, transparent order book etc. Latency test description System configuration In this test, the goal was to achieve low latency. To this end, we created a minimal three-node TRADExpress configuration with a single gateway (GW, an integration of the access point and the routing layer), a single matching engine (ME) and a load generator. © Cinnober Financial Technology AB 11 (16) We turned off data replication (no standby server) but still performed full asynchronous recovery logging. Published broadcasts were limited to only essential public information flows, such as distribution of market data, i.e. flows internal to the trading system were disabled. Load generator Matching engine Gateway Door-to-door latency: 286 μs Business logic latency: 138 μs The GW and ME ran on BL460 blades, the load generator on a BL685c. The interconnect was IP-over-Infiniband. With 8 simulated users producing an aggregated load of 300 order entries/s, the average door-to-door latency was 286 µs and the time spent on business logic, i.e. in the matching engine, was 138 µs. Throughput test description System configuration In this test, the goal was to achieve high throughput. We settled on a configuration with four gateways (GW), 5 Matching Engines (ME) and a single load generator. Two nodes ran two GWs each and five nodes one ME each. Throughput: 800,000 q/s Gateway Load generator Matching engine Matching engine Matching engine Matching engine Matching engine Since it is very common for traders to want to enter multiple quotes, TRADExpress offers a Quote Entry bundle transaction. We naturally elected to use this instead of single order entry, since it offers a fast lane to the parallel execution of the MEs. With 10 simulated users producing about 2500 transactions/s load per GW, and each user transaction bundling 40 individual Quote Entry dual-leg transactions, the aggregated system transaction flow was 2500 × 4 × 40 × 2 = 800 000 quotes/s. (The exact figure was slightly higher.) Doubling the amount of servers would double the throughput, and so on. © Cinnober Financial Technology AB 12 (16) Latency vs. Throughput Normally no vendor would publish the latency achieved during a throughput test — if not performed in a configuration with a very large number of highperformance servers running in parallel — since throughput always comes at the expense of latency. However, we would like to establish new standards of transparency. Even though the throughput test was designed to maximize throughput, we have no reason to hide the fact that the latency was affected correspondingly. The latency in the throughput test was on average 4300 µs, per batch of 80 quotes, in the configuration described above. Is this surprisingly high or not? Let's apply some simple queue theory on these numbers. Little's Law 2 says the queue length Q of a system is the product of the transaction rate X and the response time R: Q=X×R In our case X equals 2500 TPS × 4 (aggregated flow through all four GWs) and R equals 4300 µs. Q = 2500 × 4 × 0.0043 = 43 I.e. the average length of the queue of transactions waiting to be processed is 43. Furthermore, the response time R can be expressed as the product of the service time S (i.e. the time a request actually is processed) and the queue length plus 1: R = S × (Q + 1) => S = R / (Q+1) We can thus calculate the service time as S = 4300 µs / 44 = 98 µs. So the real processing time is actually quite short. Could we goad the system to even higher throughput, without adding more hardware? Queue theory again comes to our aid; the utilization U of a system can be expressed as: U=X×S In our case: U = 2500 TPS × 4 × 98 µs = 98 %. We are thus close to full saturation and can not expect to get more unless we upgrade the system with more servers, which could either be utilized to lower latency with constant throughput, or increase throughput at the same latency. In the end, latency and throughput comes down to pure commercial decisions. What latency level is a specific trading venue aiming at? What is the expected transaction load today, in six months and in a year? Or in other words, how much hardware should you invest in? Each player must make their decision based on their specific situation. 2 Cf. e.g. http://en.wikipedia.org/wiki/Little’s_law © Cinnober Financial Technology AB 13 (16) Roadmap Looking forward there are a number of areas on which Cinnober is focusing to reduce latency further. The target is to reach a door-to-door latency below 80 microseconds within one year and 25 microseconds within 18 months. The more notable focus areas are:  One of the main features of the TRADExpress Trading System is its ability to run on commodity hardware while still yielding excellent performance. While TRADExpress will continue to retain this important characteristic a greater emphasis will be put on tweaking and tuning the application to take advantage of high-end hardware and operating system specific features and characteristics. This will involve extensive benchmarking and tuning activities on a selected set of hardware platforms. General improvements with regard to business logic. While it is easy to achieve excellent latencies for simple order types and matching scenarios, trading venue clients do tend to use more advanced system features like peg, strategy and stop-loss orders. There is on-going work optimizing these algorithms to actual production usage patterns. Co-location of algo-applications is a well known concept that can be used to reduce latency. TRADExpress will take this concept to a new level where algo-applications are co-located within the TRADExpress Trading System. This is made possible by a unique ‘sand-box’ concept. This will for all practical purposes remove all network latency giving hot-wire capability to algoapplications.   © Cinnober Financial Technology AB 14 (16) Conclusion Our tests show what is possible today and that even a very modest investment can buy excellent performance. We have also identified some of the limits to performance, and, more importantly, the cost of alleviating them.  Throughput and latency are opposites and a certain trade-off between them must always be accepted. Simultaneous high throughput and low latency can only be achieved through substantial investment in parallel servers. The traditional method for trading venues to ensure data integrity—having synchronized remote standby systems for disaster recovery— has a negative impact on latency. Trading venue operators may consider accepting a simpler transaction model with asynchronous logging to a remote back up site to lower latency at the cost of a financial risk in their business model. Infiniband is currently faster than Ethernet, but so far requires significant coding efforts to utilize at full capacity. It can also only be used with shorter distances, e.g. within a single site. Hard numbers should always be backed by hard facts, such as hardware configuration and measuring method.    As issues of latency and throughput are central to the assessment of most trading venues we strongly believe that performance figures should be supported by a level of detail that clearly makes them useful as benchmarks. We are of the opinion that there is still a lot to be done in this area. Outstanding performance has, and will always be, at the very heart of Cinnober’s solutions, whether it is in processing orders, quotes and trades, performing realtime risk calculations or distributing market data. To further optimize performance and utilize new technologies as they arise is an ever ongoing process. At Cinnober we will continue to have an open and transparent approach to these tasks, not only with regard to the measurement of latency but also to the publication of supportable facts regarding system performance. © Cinnober Financial Technology AB 15 (16) Passion for change | Cinnober is the world’s leading independent provider of innovative marketplace and clearing technology. Our solutions are tailored to handle high transaction volumes with assured functionality and low latency. We are passionate about one thing: applying advanced financial technology to help marketplaces seize new opportunities in times of change. Among our customers are leading exchanges such as the Chicago Board Options Exchange, the London Metal Exchange and NYSE Liffe. We also power new initiatives and alternative trading systems such as Alpha Trading Systems, Markit BOAT and Turquoise. Our clients rely on our platform-independent, Java-based technology for leveraging change quickly and cost-effectively. Cinnober is headquartered in Stockholm and employs 160 people together having more than 1500 man years of experience from developing exchange and post-trade systems. We are an independent technology provider and do not operate a market of our own, avoiding any conflicts of interest. We are not owned by — nor have any ownership interests in — any market operator. Our track record says it all. We help our customers turn change into a competitive advantage.