We perform experiments in production to harden Search against outages and make sure that whenever a customer searches for products, they find what they are looking for.
In this role you will:
- Design, implement, execute, and automate chaos experiments to continuously test Amazon Search' resilience against hardware failures, dependency outages, traffic spikes and more.
- Collaborate with service owners to remedy vulnerabilities, minimize blast radius and harden Amazon Search.
- Research tools and practices in resilience engineering and adopt them as appropriate.
Joining this team, you'll experience the benefits of working in an entrepreneurial environment, while leveraging the resources of (AMZN), one of the world's leading internet companies.
We are a diverse, customer-obsessed and passionate team located in Meguro, Tokyo.Key job responsibilities
- Develop and maintain our chaos experiment orchestrator
- Design, execute, automate, and maintain chaos experiments
- Develop and maintain our distributed load generator
- Develop and maintain our petabytescale log archival and query service
- Join a 12/12 oncall rotation for incident response and mitigation
Basic Qualifications:
- Experience programming with at least one modern language such as Python, Ruby, Golang, Java, C++, C#, Rust
Preferred Qualifications:
- Experience with Linux/Unix
- Experience in networking, storage systems, operating systems and handson systems engineering
- Experience with distributed operational health and performance monitoring systems
If you have a disability and need a workplace accommodation or adjustment during the application and hiring process, including support for the interview or onboarding process, please visit for more information.
If the country/region you're applying in isn't listed, please contact your Recruiting Partner.-
Tokyo Amazon Full timeWe perform experiments in production to harden Search against outages and make sure that whenever a customer searches for products, they find what they are looking for. · ...
-
Tokyo AmazonJoin the Chaos Engineering team in Amazon Search. · Design, implement, execute, and automate chaos experiments to continuously test Amazon Search' resilience against hardware failures... · Collaborate with service owners to remedy vulnerabilities... · ...
-
Tokyo AmazonJoin the Chaos Engineering team in Amazon Search to perform experiments in production to harden Search against outages. · Design implement execute automate chaos experiments to continuously test Amazon Search' resilience against hardware failures dependency outages traffic spikes ...
-
Tokyo (株)ドワンゴ ¥8,000,000 - ¥11,000,000AWSインフラ設計・構築・最適化(EKS、ECS、RDS/Aurora、IAM、VPCほか)、Kubernetes環境の設計・運用(マルチクラスタ管理、Service Mesh改善検討など)等を行う。 · AWSでのインフラ構築・運用経験3年以上 · Kubernetesクラスタの設計・運用経験 · TerraformなどIaCツールでのコーディング経験 · ...
-
Tokyo (株)ドワンゴ ¥8,000,000 - ¥11,000,000クラウドインフラ設計・構築・最適化(EKS、ECS、RDS/Aurora、IAM、VPC)、Kubernetes環境の設計・運用およびTerraform/TerragruntによるIaCとCI/CDパイプライン改善を行う。 · ...
-
Tokyo UiPath Full timeThis is a high-impact, principal level role designed for an engineer who excels in the "heat of the moment". · ...
-
Tokyo Treasure Data Full timeTreasure Data is seeking a Site Reliability Operations Manager to oversee our Japan-based Site Reliability Engineering team. The successful candidate will work closely with North-America-based counterparts to design and implement solutions for high-scale challenges. · ...
-
Tokyo Woven by ToyotaWe are looking for a Senior SRE engineer with a background in software engineering observability cloud engineering You will provide technical leadership guide technical decision making support roadmap planning enable effective cross team collaboration offer ongoing mentorship dev ...
-
Tokyo Woven by Toyota Full timeWe are seeking a senior SRE engineer to collaborate with the product development team and enhance production readiness and reliability. · Our ideal candidate will have experience in software engineering, observability, and cloud engineering. They will provide technical leadership ...
-
Tokyo Microsoft Full timeWith more than 45,000 employees and partners worldwide, the Customer Experience and Success (CE&S) organization is on a mission to empower customers to accelerate business value through differentiated customer experiences that leverage Microsoft's products and services. We drive ...
-
Tokyo Relocate $1,200,000 - $1,500,000 per yearWe are looking for experienced SREs who can deliver insights into system bottlenecks and ensure system reliability and scalability. · Analyze current technologies used in the company and develop monitoring and notification tools to improve observability and visibility. · ...
-
Japan JobgetherJob summary · This position is pivotal in shaping the security posture of the organization while ensuring platform reliability. · You will have the opportunity to collaborate closely with our Site Reliability Engineering (SRE) team. · ...
-
Japan OracleSolve complex problems related to infrastructure services build automation to prevent problem recurrence design write deploy software improve availability scalability efficiency Oracle products services. · ...
-
Chiyoda CitiThe Applications Support Lead Analyst is a seasoned professional providing Level 2 production support directly to Front and Middle Office users within Citi Japan's Equities business. · Provides hands-on trade floor presence, supporting mission-critical trading applications in a h ...
-
Tokyo (株)ドワンゴ ¥8,000,000 - ¥11,000,000 per yearAWS インフラ設計・構築・最適化(EKS、ECS、RDS/Aurora、IAM、VPC ほか)、Kubernetes 環境の設計・運用(マルチクラスタ管理、Service Mesh 改善検討など)、Terraform/Terragrunt 等による IaC と CI/CD パイプライン改善等を担当します。 · ...
-
DevOps Engineer
3週間前
Tokyo SMALL WORLD / Work in Japan?Ensure high availability and scalability of multi-region production environments through automation and proactive monitoring. · Design, build, and maintain CI/CD pipelines · ...
-
DevOps Engineer
4週間前
Tokyo Morgan McKinleyA technology-driven financial firm building next-generation trading and financial platforms. · ...
-
Tokyo Rakuten Mobile, Inc.The OSS & Automation Department at Rakuten Mobile Inc plays a critical role in the company's innovative and disruptive approach to telecommunications. · This role is paramount for leading the operations of our cloud-native fully containerized AI-Assisted OSS platform for telecom ...
-
Tokyo Computer FuturesJoin a fast-growing global SaaS company and lead infrastructure reliability. · Lead and empower an engineering team to deliver reliable, scalable solutions · Manage infrastructure sustainability, service procurement, and vendor relationships · ...
-
Tokyo SMALL WORLD / Work in Japan?DevOps & Observability Platform Engineer (L2 Support) - Telecom BSS. · Ensure operational excellence for internal DevOps and Observability platforms through proactive monitoring, alert handling, and initial troubleshooting. · ...
-
Tokyo SMALL WORLD / Work in Japan?DevOps & Observability Platform Engineer (L2 Support) - Telecom BSS. · ...