Tokyo Amazon Full time
Join the Chaos Engineering team in Amazon Search.

We perform experiments in production to harden Search against outages and make sure that whenever a customer searches for products, they find what they are looking for.



In this role you will:
- Design, implement, execute, and automate chaos experiments to continuously test Amazon Search' resilience against hardware failures, dependency outages, traffic spikes and more.

  • Collaborate with service owners to remedy vulnerabilities, minimize blast radius and harden Amazon Search.
  • Research tools and practices in resilience engineering and adopt them as appropriate.

Joining this team, you'll experience the benefits of working in an entrepreneurial environment, while leveraging the resources of (AMZN), one of the world's leading internet companies.

We are a diverse, customer-obsessed and passionate team located in Meguro, Tokyo.

Key job responsibilities

  • Develop and maintain our chaos experiment orchestrator
  • Design, execute, automate, and maintain chaos experiments
  • Develop and maintain our distributed load generator
  • Develop and maintain our petabytescale log archival and query service
  • Join a 12/12 oncall rotation for incident response and mitigation

Basic Qualifications:

  • Experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust

Preferred Qualifications:

  • Experience with Linux/Unix
  • Experience in networking, storage systems, operating systems and handson systems engineering
  • Experience with distributed operational health and performance monitoring systems
Our inclusive culture empowers Amazonians to deliver the best results for our customers.

If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit for more information.

If the country/region you're applying in isn't listed, please contact your Recruiting Partner.

  • Tokyo Amazon Full time

    We perform experiments in production to harden Search against outages and make sure that whenever a customer searches for products, they find what they are looking for. · ...


  • Tokyo Amazon

    Join the Chaos Engineering team in Amazon Search. · Design, implement, execute, and automate chaos experiments to continuously test Amazon Search' resilience against hardware failures... · Collaborate with service owners to remedy vulnerabilities... · ...


  • Tokyo Amazon

    Join the Chaos Engineering team in Amazon Search to perform experiments in production to harden Search against outages. · Design implement execute automate chaos experiments to continuously test Amazon Search' resilience against hardware failures dependency outages traffic spikes ...


  • Tokyo (株)ドワンゴ ¥8,000,000 - ¥11,000,000

    AWSインフラ設計・構築・最適化(EKS、ECS、RDS/Aurora、IAM、VPCほか)、Kubernetes環境の設計・運用(マルチクラスタ管理、Service Mesh改善検討など)等を行う。 · AWSでのインフラ構築・運用経験3年以上 · Kubernetesクラスタの設計・運用経験 · TerraformなどIaCツールでのコーディング経験 · ...


  • Tokyo (株)ドワンゴ ¥8,000,000 - ¥11,000,000

    クラウドインフラ設計・構築・最適化(EKS、ECS、RDS/Aurora、IAM、VPC)、Kubernetes環境の設計・運用およびTerraform/TerragruntによるIaCとCI/CDパイプライン改善を行う。 · ...


  • Tokyo UiPath Full time

    This is a high-impact, principal level role designed for an engineer who excels in the "heat of the moment". · ...


  • Tokyo Treasure Data Full time

    Treasure Data is seeking a Site Reliability Operations Manager to oversee our Japan-based Site Reliability Engineering team. The successful candidate will work closely with North-America-based counterparts to design and implement solutions for high-scale challenges. · ...


  • Tokyo Woven by Toyota

    We are looking for a Senior SRE engineer with a background in software engineering observability cloud engineering You will provide technical leadership guide technical decision making support roadmap planning enable effective cross team collaboration offer ongoing mentorship dev ...


  • Tokyo Woven by Toyota Full time

    We are seeking a senior SRE engineer to collaborate with the product development team and enhance production readiness and reliability. · Our ideal candidate will have experience in software engineering, observability, and cloud engineering. They will provide technical leadership ...


  • Tokyo Microsoft Full time

    With more than 45,000 employees and partners worldwide, the Customer Experience and Success (CE&S) organization is on a mission to empower customers to accelerate business value through differentiated customer experiences that leverage Microsoft's products and services. We drive ...


  • Tokyo Relocate $1,200,000 - $1,500,000 per year

    We are looking for experienced SREs who can deliver insights into system bottlenecks and ensure system reliability and scalability. · Analyze current technologies used in the company and develop monitoring and notification tools to improve observability and visibility. · ...


  • Japan Jobgether

    Job summary · This position is pivotal in shaping the security posture of the organization while ensuring platform reliability. · You will have the opportunity to collaborate closely with our Site Reliability Engineering (SRE) team. · ...


  • Japan Oracle

    Solve complex problems related to infrastructure services build automation to prevent problem recurrence design write deploy software improve availability scalability efficiency Oracle products services. · ...


  • Chiyoda Citi

    The Applications Support Lead Analyst is a seasoned professional providing Level 2 production support directly to Front and Middle Office users within Citi Japan's Equities business. · Provides hands-on trade floor presence, supporting mission-critical trading applications in a h ...


  • Tokyo (株)ドワンゴ ¥8,000,000 - ¥11,000,000 per year

    AWS インフラ設計・構築・最適化(EKS、ECS、RDS/Aurora、IAM、VPC ほか)、Kubernetes 環境の設計・運用(マルチクラスタ管理、Service Mesh 改善検討など)、Terraform/Terragrunt 等による IaC と CI/CD パイプライン改善等を担当します。 · ...

  • DevOps Engineer

    3週間前


    Tokyo SMALL WORLD / Work in Japan?

    Ensure high availability and scalability of multi-region production environments through automation and proactive monitoring. · Design, build, and maintain CI/CD pipelines · ...

  • DevOps Engineer

    4週間前


    Tokyo Morgan McKinley

    A technology-driven financial firm building next-generation trading and financial platforms. · ...


  • Tokyo Rakuten Mobile, Inc.

    The OSS & Automation Department at Rakuten Mobile Inc plays a critical role in the company's innovative and disruptive approach to telecommunications. · This role is paramount for leading the operations of our cloud-native fully containerized AI-Assisted OSS platform for telecom ...


  • Tokyo Computer Futures

    Join a fast-growing global SaaS company and lead infrastructure reliability. · Lead and empower an engineering team to deliver reliable, scalable solutions · Manage infrastructure sustainability, service procurement, and vendor relationships · ...


  • Tokyo SMALL WORLD / Work in Japan?

    DevOps & Observability Platform Engineer (L2 Support) - Telecom BSS. · Ensure operational excellence for internal DevOps and Observability platforms through proactive monitoring, alert handling, and initial troubleshooting. · ...


  • Tokyo SMALL WORLD / Work in Japan?

    DevOps & Observability Platform Engineer (L2 Support) - Telecom BSS. · ...