Adrian is a principal engineer at Hotels.com in London where he works with teams focusing on the services powering their big data processing systems. He is also involved in Prior to this Adrian led the big data team at Last.fm and has been using Hadoop and various other parts of the big data ecosystem since 2007. He has previously spoken at Strata and co-wrote a chapter in the early editions of the seminal “Hadoop: The Definitive Guide”.
This presentation describes the journey taken by the Hotels.com big data platform team when tasked with migrating big data sets and pipelines from on-premises clusters to cloud based platforms. We present two open source tools that we built to overcome the unexpected challenges we faced.
The first of these is Circus Train—a dataset replication tool that copies Hive tables between clusters and clouds. We will also discuss various other options for dataset replication and what unique features Circus train has. The second tool is Waggle Dance—a federated Hive query service that enables querying of data stored across multiple Hive metastores. We will demonstrate the differences between Waggle Dance and existing federated SQL query engine tools and what use cases it enables. Giving real world examples, we will describe how we've used these tools to successfully build a petabyte scale platform that is now also being used by other brands within the Expedia organisation. We focus on actual problems and solutions that have arisen in a huge, organically grown corporation, rather than idealised architectures.