By Andrew Robinson, Well-Architected Geo Systems Architect at AWS
By Phil Horn, Director of Business Development at Steamhaus
By Bobby Gilbert, Director of Global Innovation at Sperry Rail
Over the last three years, Sperry Rail has developed an artificial intelligence (AI) system called Elmer, named after Sperry’s founder, Dr. Elmer Sperry. Elmer uses machine intelligence to inspect thousands of miles of ultrasound scans collected by Sperry’s inspection vehicles, searching for evidence of cracks in the rail.
Elmer has already reduced by 66 percent the number of decisions a human analyst must make, lowering the time it takes to identify and rectify issues.
Elmer was built as a proof-of-concept by a team of just four engineers leveraging Amazon Web Services (AWS). To take advantage of the latest features, reduce costs, and deliver the scalability needed, Sperry engaged with Steamhaus, a DevOps and cloud consultancy based in Manchester, England.
Steamhaus is an AWS Partner Network (APN) Advanced Consulting Partner and member of the AWS Well-Architected Partner Program. Steamhaus is experienced in using the AWS Well-Architected Framework to help cloud architects build secure, high-performing, resilient, and efficient infrastructures for their applications.
In this post, we’ll describe the partnership between Steamhaus and Sperry Rail, and the way it has evolved Elmer from a proof-of-concept to a full-production, globally available system.
How Steamhaus Partnered with Sperry Rail
Over two days, Steamhaus conducted a Well-Architected Review on-site with the team who designed, built, and currently manage Elmer at Sperry Rail. They bundled the Well-Architected Review with a professional services consultancy.
This partnership allowed quick improvements in efficiency, highlighting the biggest risks that needed to be immediately addressed while ensuring the requirements of running the business day-to-day did not get in the way of improving Elmer.
Taking the approach of a peer review, the Steamhaus consultants combined AWS Well-Architected best practices with their knowledge of, and experience with, building and operating workloads on AWS. They proposed short- and long-term improvements to help Sperry across the Five Pillars of AWS Well-Architected: operational excellence, security, reliability, performance efficiency, and cost optimization.
We found Elmer to be fascinating, as well as rather complex, and have divided our post into sections:
- What Elmer does
- Objectives for refining Elmer
- Resulting architecture
- Why Sperry chose containers
What Elmer Does
Rails in service develop anomalies that can be hard or impossible to see without ultrasound detection. The anomalies vary from non-flaws, such as bolt holes and rail ends, to flaws in the rail like cracks and vertical defects. If not detected, they can grow until the rail breaks, leading to service interruptions or train derailments.
Sperry Rail mounts ultrasound (and other forms of detection) systems on vehicles that scan hundreds of miles of rail each day. It feeds the data it collects into neural networks that analyse the data and identify cracks or other anomalies in the rail. The data is presented to a human analyst, who determines the corrective action.
These rail scans generate vast quantities of data, which Sperry uses to train its neural networks. Training consists of running thousands of scans of both faulty and healthy rail sections through machine learning (ML) models until the models learn how to detect faulty sections from healthy sections.
As more data becomes available, Elmer uses it to refine the ML models.
Figure 1 – Click to enlarge and learn more about what Elmer does.
Objectives for Refining Elmer
One of the main goals was to free Sperry engineers from monitoring, fixing, and getting distracted by the day-to-day activities of operating a large-scale machine learning workload so they could spend more time on what matters to the business.
To that end, Steamhaus identified two major architectural goals:
- Adopt more serverless services and abstract more resources.
- Implement a managed build, continuous integration, and load test environment by taking advantage of the automation features in AWS services.
The Resulting Architecture
We designed the following architecture in close collaboration with Sperry:
Figure 2 – Click to enlarge and dive deep on the architecture.
TensorFlow Machine Learning Framework
Sperry chose Tensorflow as the ML framework because of its exceptional open source support community, flexibility to operate on a variety of different platforms, and speed with which they can build, train, and deploy new models.
Elmer’s neural network is capable of consuming up to one hundred Amazon Elastic Compute Cloud (Amazon EC2) instances using the TensorFlow framework, so Steamhaus suggested Sperry host Elmer on AWS Lambda so they would not have to worry about administrative overhead.
Sperry’s Data Lake
Because Sperry was using simple data types, DynamoDB was a good choice because it provides easier scaling than a traditional relational database management system (RDBMS), and it performs consistently regardless of scale.
They selected Aurora for its simplicity of use and low cost, and because it let Sperry easily integrate new data sources stored in Amazon Simple Storage Service (Amazon S3). Sperry chose Athena for its ability to operate at significant scale and easily scale database resources, while being able to use the same tools as MySQL.
All the stored artifacts were generated either directly from the Sperry Data Management System (SDMS) database, or as part of the data processing and ML process in Amazon S3. Sperry took this approach to reduce the load on the compute resources, reserving them to run the application rather than using them to store and retrieve static content.
How Elmer Processes Rail Data
The real-time acquisition systems are located on the rail vehicles. They generate scan data from up to 16 ultrasonic transducers per rail at speeds of up to 80kph. This data is stored in a proprietary “T1k” file format and uploaded to the SDMS servers. SDMS runs on a relational SQL database on Microsoft Windows servers. From here, the T1k data is diverted to Amazon S3 on the cloud where it triggers Lambda processes.
AWS Lambda is used to extract the headers from the proprietary T1K raw data files coming from SDMS and write them to the DynamoDB table. The T1K files contain the raw data from the 16 ultrasonic transducers, plus GPS location, milepost location, and operational meta data.
The transducer data-series in the T1k files is stored in encrypted Parquet files in S3 where they can be queried by Athena and streamed for consumption by the TensorFlow ML model.
Sperry Rail selected Parquet because it can encrypt the data for transmission and can be natively queried by Athena from S3. There was a slight size penalty to pay, as they are around twice the size of the native T1K file, but that penalty is offset by the ability to query the files natively.
Amazon Simple Queue Service (Amazon SQS) and Amazon Simple Notification Service (Amazon SNS) manage the data flows into the neural network. Amazon SQS decouples these components and provides scalability between parts of the application. AWS Step Functions help manage and coordinate the workflow between Lambda and Amazon SQS.
Other metadata used to train and validate the machine learning arrives in XML files and is parsed and loaded into DynamoDB. Results from the ML analysis are also stored in DynamoDB, from where they are packaged into XMLfiles and delivered to the data scientist.
Why Sperry Chose Containers
AWS Fargate is a serverless compute engine for containers that removes the need to provision and manage servers. Fargate allocates the right amount of compute, eliminating the need to choose instances and scale cluster capacity. Sperry only pays for the resources required to run its containers.
Sperry jobs are now packaged into self-contained Docker containers running in Fargate rather than on Amazon EC2 instances, which were mapped 1:1 (i.e. one container on one instance).
This approach at first presented scaling issues, since the only option for dealing with more jobs was to add more instances, even though most were only 30-70 percent utilized. Fargate gave Sperry the ability to improve the utilization of those resources, and quickly optimize for cost or throughput.
By adopting a containerized, serverless approach to the architecture, Steamhaus reduced by 15 percent the time Sperry engineers were spending on systems management and deployment, freeing them to focus on adding new features.
The consulting services from Steamhaus based on the AWS Well-Architected Framework has supported the Elmer team at Sperry Rail, so they can continue to focus on developing new capabilities to meet customer demands and changing requirements within their industry.
The infrastructure automation, image building, continuous integration, and automated load testing provided by AWS services lets Sperry quickly deploy Elmer into many AWS regions across the world in hours.
Steamhaus – APN Partner Spotlight
Steamhaus is an AWS Well-Architected Partner that helps cloud architects build secure, high-performing, resilient, and efficient infrastructures for their applications.
*Already worked with Steamhaus? Rate this Partner
*To review an APN Partner, you must be an AWS customer that has worked with them directly on a project.