IBM Cloud

Jul 2015 to Dec 2021

Consultant: 6 years, 6 months

Roles: Architect, Principal Software Engineer, Team Lead (12 people)

Summary

The IBM Cloud projects I worked on are complex, large-scale, and have an extensive scope.

It's difficult to express the problem, challenges, activities, and results in a few sentences.

Big, complex projects require a deeper dive, which is what I write about.

Overview

Employed by Innova Solutions as a full-time consultant with IBM Cloud.

IBM Cloud engages consultants to apply pressure to immediate challenges, opportunities, and problems.

My work was cloud-scale infrastructure and related technologies.

I delivered results in the following roles:

Architect for new features and products.
Principal Software Engineer to get products over the goal line, research and develop solutions for challenging problems, and remediate cloud-scale issues and challenges.
Team lead to deliver features and upgrades in complex and large cloud-scale projects.

Mar 2021 to Dec 2021

Consultant: 10 months

Roles: Architect, Principal Software Engineer, Team Lead (10 people)

Technologies: Ansible, Go, Python, QEMU virtualization

New Operating System project

IBM is adding a new operating system, which is used by every IBM Cloud server.

A new operating system is required to implement secure environments and support new hardware and software features.

The operating system performs the following:

Manages customer workloads (VMs - Virtual Machines), such as Linux or Windows.
Allows many VMs to run on a single physical server and provides isolation for each VM. This is why cloud services are inexpensive - a physical server is used by many users, lowering the cost for everyone.
Migrate VMs from one server to another with no downtime or disruptions.

New operating system challenges

Customer workloads (VMs) could not incur any downtime or disruptions, as they migrated from an existing operating system to a new operating system.

The existing core infrastructure could not change. Everything had to run as-is across multiple operating system. Examples are: deployments, upgrades, monitoring and observability, tools, and scores of other systems.

The chain of software that builds a server, from powered off to ready for customer use, is wide and deep. Adding a new operating system is like performing heart surgery with 12 foot surgical instruments - one slipup and the patient dies.

New Operating System activities

Worked hand-in-hand with the operating system team to design and implement how compute, security, and storage works with the new operating system.

Lead the team that integrated the new operating system, migrated from AppArmor to SELinux security, and implemented new features.

Extensive testing phase to ensure that storage (imports, exports, backups, restores, and migrations) worked in all situations.

Work with the infrastructure team to create a CI/CD (Continuous Integration / Continuous Delivery) pipeline to deploy the new operating system to 1,000s of hardware platforms.

New Operating System results

As an Architect, designed the system that integrates compute and storage such that SELinux would provide complete isolation for customer VMs. Implemented in Go and Python using QEMU for virtualization.

As a Principal Software Engineer, wrote the code that enables the migration of existing workloads from AppArmor security (old operating system) to SELinux security (new operating system). Implemented in Python using QEMU virtualization.

As a Principal Software Engineer, changed the CI/CD pipeline to deploy a new operating system. This may sound trivial, but the deployment pipeline, in a cloud, has a code base larger than most enterprise applications. Implemented using Ansible and Python.

As the Team Lead, implemented SELinux on the new operating system, such that it provided the most secure environment possible for customer workloads (VMs). This work enables the real-time migration of workloads from AppArmor to SELinux security.

Apr 2020 to Feb 2021

Consultant: 11 months

Roles: Principal Software Engineer

Technologies: Go, Python

Key Protect (encryption) project

Key Protect (KP) is an encryption solution that allows customer data to be secured and stored in the cloud using data encryption techniques.

Customer data is protecting in transit and at rest by wrapping the data with a data key and a root key. Data and root keys are managed by the user or IBM.

The development team owns the end-to-end product. From development to managing production platforms to customer tools and documentation.

Key Protect (encryption) challenges

Key Protect is a complex product that requires deep development knowledge (security is paramount), extensive code reviews, operational expertise, and compliance with internal and external stakeholders.

One mistake, one operational hiccup, or one damaged or lost key can lead to crypto-shredding. This results in customer data that can never be recovered.

The Key Protect CLI (Command-Line Interface) software, which is used by customers to automate workflows, required a complete rewrite to incorporate new features and upgrades.

The Key Protect documentation lacked examples, explanations, and use cases. It was difficult to figure out how to create an end-to-end data protection (encryption) solution from the documentation.

Key Protect (encryption) activities

Develop Key Protect features, which secure the product or add functionality. Spent roughly 50% of my time in architecture and code reviews, looking for obscure errors and edge cases.

Automate platform operations to ensure thousands of components are performing as expected. For example, root keys are regularly rotated, certificates are kept up-to-date, and every Kubernetes cluster, node, pod, and container is correct (compliant, not outdated, operational, up-to-date, etc.).

Wrote the Key Protect CLI, in Go, from scratch.

Develop Key Protect API/CLI examples, explanations, and use cases. This is tedious and exacting work as customers often copy-and-paste examples, which they expect to work. One mistake in a workflow and customer data has the potential to be crypto-shredded.

Key Protect (encryption) results

As a Principal Software Engineer, the Key Protect CLI was rewritten from scratch in Go. Added JSON support, which was well-received by customers, and saw usage move up and to the right as the CLI is embedded in customer workflows.

As a Principal Software Engineer, wrote a Go program to validate IBM-specific markdown language and report warning and errors. There are hundreds of validation rules and it also validates links across hundreds of IBM documentation projects. See the detailed writeup for the Markdown Validator project below.

Markdown Validator project

As a Principal Software Engineer, developed the Key Protect API and CLI documentation and examples in Curl, Go, Java, Node, Python, and Shell. Production documentation links are below.

In a Principal Software Engineer role, automated platform operations, which reduced the workload for the entire team. Automation ensures that errors are not introduced, or steps are not skipped. Implemented in Python.

Apr 2018 to Mar 2020

Consultant: 2 years

Roles: Principal Software Engineer, Team Lead (12 people)

Technologies: Go, GraphQL, Kubernetes, Micro Services, Oracle, React

Cloud Modernization project

Multi-year project to migrate legacy (monolithic) functionality to micro services as IBM transitions to a VPC (Virtual Private Cloud) architecture.

Rewrite legacy PHP and Python code in Go. Implement a new architecture while ensuring that all corner and edge cases, in the legacy code, is adequately addressed.

Project required coordinating continuous upgrades across platforms (monolithic and micro services) and technologies (front end, middleware, back end, databases, etc.).

Cloud Modernization challenges

Product development never stops as features and products are added to both platforms (monolithic and micro services) simultaneously.

Coordinating complex activities across organizations, people, platforms, and technologies was a never-ending feat of synchronization.

Maintaining functional parity (API, CLI, web , internal interfaces) and a single design language over multiple years, was yet another challenge.

Cloud Modernization activities

Migrated the front end from JavaScript-based frameworks to React and GraphQL. This activity touched every customer interface (user interface, workflows, documentation, etc.) across hundreds of cloud products.

Carved off large chunks of monolithic back end functionality and implemented them as micro services in Go, deployed via Kubernetes. Developed new API and CLI interfaces, which extend from the front end deep into the back end and infrastructure services.

Refactor customer ordering, billing, metering, and invoicing. Updates to core systems required massive database migrations (10s of billions of records) to accommodate new product offerings.

Cloud Modernization results

As a Principal Software Engineer, successfully upgraded hundreds of systems, which have 0 (zero) tolerance for downtime or service interruptions.

As a Principal Software Engineer, migrated IBM Cloud from IaaS (Infrastructure as a Service) to a VPC (Virtual Private Cloud) offering.

As a Team Lead (12 people), delivered hundreds of upgrades (front end, middleware, back end, database), deployed across 50+ global data centers.

As a Team Lead, upgraded internal databases and systems (customer ordering, billing, etc.) to accommodate new, competitive, product offerings.

May 2017 to Mar 2018

Consultant: 11 months

Roles: Architect, Team Lead (8 people)

Technologies: Go, Oracle, PHP

Spot Instances project

Spot Instances are VMs (Virtual Machines) available at steeply discounted rates compared to on-demand VMs - up to a 90% discount. A lower cost comes with the compromise that the cloud provider can reclaim the VM with a short warning.

The product had to satisfy a market demand for lower-cost resources without triggering a mass migration from higher-priced, on-demand resources to lower-cost Spot Instances.

New development work touched most cloud systems, such as user interface (UI), billing, core services, databases, documentation, support, application programming interfaces (API), command line interfaces (CLI), software development kits (SDK), etc.

Spot Instances challenges

Necessitated a rewrite of the VM allocation algorithm, which identifies idle resources (spare compute capacity) and allocates new Spot Instances across a fleet of servers.

Add Spot Instance support to every system that interfaces with VMs - front end, back end, middleware, databases, alerts and notifications, ordering, billing, metering, and invoicing.

The product road map required a tight customer delivery schedule because additional cloud features and products required work that was dependent on Spot Instances.

Spot Instances activities

Worked extensively with product and development teams on architecture, features, planning, and requirements.

Created the blueprint to implement Spot Instances. The plan touched dozens of systems and required an excessive amount of development and testing. A mandatory requirement was do not disrupt existing production VMs and workloads at any cost.

Built and led a team of developers with diverse skills and years of cloud development experience.

Spot Instances results

As an Architect, Spot Instances went from an idea to implementation (customer use) in 11 months. Planning and documenting took three months. Development and testing lasted six months. System and regression testing took two months.

As an Architect, I worked with the product team on architecture and customer fit. Once the product was defined, I created the technology blueprint that each team used for development and testing.

As a Team Lead (8 people), I shepherded the project from initial concept to product delivery and customer use.

Spot Instance functionality was delivered on time.

Jul 2015 to Apr 2017

Consultant: 1 year, 10 months

Roles: Architect, Principal Software Engineer, Team Lead (6 people)

Technologies: Ansible, Chef, Python

Object Storage project

IBM was transitioning their object storage (like Amazon S3) from an OpenStack Swift implementation to Cleversafe, an object storage company IBM bought in 2015.

The road map called for deploying Cloud Object Storage (COS) at 50+ global cloud data centers.

The team was required to maintain the existing OpenStack Swift platform at the same time it built, automated, deployed, and migrated new hardware and software platforms.

Object Storage challenges

Keep the existing OpenStack Swift platform stable. Seamlessly migrate customers to the new Object Storage platform.

Accelerate the deployment of exabytes of Cloud Object Storage hardware and software across 50+ global data centers in an automated and repeatable way.

Automate the deployment of hardware (compute, networking, storage), software, middleware, and systems in all data centers, while ensuring corporate and FedRAMP compliance.

Object Storage activities

Create systems and software to automate the deployment of large-scale Object Storage clusters. The first deployment took 7 months, the second deployment was 7 weeks, and the third deployment was 7 days (about the theoretical limit to deploy massive petabyte storage systems).

Deploy exabytes of Object Storage was accomplished by treating the entire process as infrastructure as code (Iac).

Architect, design, and implement systems to discover and analyze large-scale, distributed Object Storage anomalies across 50+ global data centers (exabyte scale). The goal is to ensure consistency and repeatability of cloud deployments and continuous cloud operations.

Triage and remediate cloud-scale Object Storage issues, such as alerts, anomaly detection, load balancing, and monitoring.

Object Storage results

As an Architect, I automated the end-to-end process to address frailties of managing large-scale deployments in a consistent and repeatable manner.

As a Principal Software Engineer, I wrote Python code that detects anomalies across 50+ global data centers (exabyte scale).

As a Principal Software Engineer, I developed the process for automating the deployment of large-scale Object Storage clusters across 50+ global data centers. Time to deploy went from 7 months to 7 weeks to 7 days.

As a Team Lead, I worked with product owners, project managers, and developers to create an end-to-end Object Storage deployment solution that scaled across 50+ global data centers.

Thank you

This is a lot to take in. Thank you for reading this far.

Jul 2015 to Dec 2021​

Summary​

Overview​

Mar 2021 to Dec 2021​

New Operating System project​

New operating system challenges​

New Operating System activities​

New Operating System results​

Apr 2020 to Feb 2021​

Key Protect (encryption) project​

Key Protect (encryption) challenges​

Key Protect (encryption) activities​

Key Protect (encryption) results​

Apr 2018 to Mar 2020​

Cloud Modernization project​

Cloud Modernization challenges​

Cloud Modernization activities​

Cloud Modernization results​

May 2017 to Mar 2018​

Spot Instances project​

Spot Instances challenges​

Spot Instances activities​

Spot Instances results​

Jul 2015 to Apr 2017​

Object Storage project​

Object Storage challenges​

Object Storage activities​

Object Storage results​

Thank you​