Lead Data Engineer

full time

London, United Kingdom

Publié il y a 7 ans

We are looking for a hands-on lead data engineer to take ownership of ETL at the Company and then build a team. This will involve evolving our distributed data parsing framework, building scalability towards submitting billions of tasks per day and handling multiple petabytes of data per year. This role includes design, build and maintenance, however this is not a software development role but an infrastructure development role. The successful candidate will have a passion for managing complexity.

TEAM BUILDING

We are looking for someone to start in this role in Q1 2019. It is anticipated that they will spend 8-12 weeks getting up to speed on the systems in place. During this period, they will work in conjunction with the existing staff in this area. Once trust has been built between the hire and the Company, the plan is to gradually add up to three new junior (up to three year’s experience) staff to this team over the next twelve months. As the team lead, you will be the key decision maker for these hires.

TEAM REMIT

Currently, ETL is managed by a variety of individuals from different teams. The Company’s aim in forming a dedicated ETL team is to be able to define clear boundaries of ownership between the teams. The ETL team will work closely with the parser team (writing code to process the raw data), the devops team (wider devops support in the Company) and the metadata team. The ETL team will own the process from ingesting raw data to producing a curated object store, complete with quality control metrics, derived data and appropiate APIs and visualization to access those outputs.

TECHNICAL BACKGROUND

The fully-automated company framework runs highly complex multi-step data processing code around the clock, scaling on-demand using AWS spot fleet. The platform is currently migrating to using Apache Airflow (scheduler) and AWS BATCH (executor) as our distributed cluster computing framework. Data storage solutions we use underlying this framework include AWS S3, EFS and PostgreSQL Aurora. Files are cached on a NFS layer (EFS) pending being moved to long-term S3 storage. File states are recorded in SQL using AWS Cloudwatch Events, which are then used to
control movement through the graph. For the batch framework we use SQS as a message broker between scheduler and executor. Elasticache is used as an in-memory database. The ETL team will own this batch processing framework, currently being used to submit tens of millions of tasks per day and move the system towards being able to support submission of billions of tasks per day. You will increase the functionality, reliability and scalability of the system, as well as maintaining the existing system.
The ETL team will also own the billions of output files from the parsing process which sit on S3. This will involve issues such as curation of the data, mapping the data in a relational database, checking for existence, consistency and corruption.

Examples of the sort of problems you will work on are the following,

Owning the packaging and build of the data processing code into a container.
Build and visualize DAGs. Perform reporting and predictions using DAGs.
The batch environment requires resources, such as database connections. There is no point initiating a job if the resources it requires are not available. • When external resources are required, optimize use of these resources
Minimize cost – schedule jobs taking into account predictable returns in the spot price.
Schedule jobs conditional on when they need to be finished by.
Use the AWS ELK stack to monitor logging. Understand what jobs have failed and why. Rerun jobs and /or liaise with the development team as required.
Designing and building a Python feature extraction library for the data. Running this library in the DAG and persisting results into SQL.
Evaluation of technologies that could add-value, for example AWS Glue, AWS SWF, AWS Data Pipeline. This would be in conjunction and consultation with the CTO and our AWS solutions architect.
Design and maintain the quality control framework for ETL. Produce user facing dashboards to enable inspection of quality.

MACHINE LEARNING

The applicant needs to be clear that this is not a machine learning role per se but that as a Company we work in the field of machine learning and that the ambition of the customers is to use machinelearning to add value to their businesses. The data output by the ETL pipeline will be used by customers / applications to perform machine learning. The scope of the application of machine learning to this role is that the Company wishes to systemize at scale feature extraction during the ETL phase and present those curated features to users through an API. Features will range from being statistically trivial to be the result of sophisticated algorithms. To give a benchmark order of magnitude, one leading data provider offers 50,000 features per security.

YOUR PROFILE

The successful candidate will have owned a production grid computing system before. You will be highly familiar with topics such as periodic and aperiodic job scheduling, building time series of directed acyclic graphs (DAGs), dependencies, failure cases, logging, retry strategies, node deaths, job profiling, notifications.
You will have been working for at least eight years in industries such as VFX, telecommunications or internet; in fact, in any firm that requires engineering around big data. So, your current employer may be a company such such as BT Research, GCHQ, Vodafone, CapitalOne, ING, Airbnb or Facebook. The vast majority of your day to day will have been software engineering around data problems – not just writing code for scientific use but engineering software systems that work at scale.
The company is a cloud native company, fully based on AWS with a codebase that is 95% Python 3.x. We do not require previous AWS or Python experience, but we do require the successful candidate to be willing to learn and work with both these technologies on joining the firm. High-performance computing is a rapidly changing field and the role requires constant learning – specifically with AWS. The successful candidate must be willing to undergo AWS training, with the support of the company, and exams soon after joining. Thereafter an important part of the role is keeping on top of the latest developments in the field.

REQUIREMENTS

Ownership of grid computing systems for data pipeline processing, using technologies like Airflow, Condor, SGE, Docker, Mesos, Chronos, Celery, IBM Symphony etc
Strong coding and scripting skills (eg Python, C++, Java) for solving automation and data transformation tasks
Docker containers
Packer to build machine images
SQL relational databases, preferably PostgreSQL.
Redis
HDF5, Parquet or other big data file formats
Ability to understand data problems and design scalable solutions to work in demanding production environments
An engineering mindset allowing you to understand complex problems and design real-life solutions.
Strong academic background in a hard science such as physics or engineering.

JOB LOCATION

London – Victoria

Lead Data Engineer

Apply For This Job