Principal Site Reliability Engineer

5日前

Tokyo UiPath

Life at UiPath

The people at UiPath believe in the transformative power of automation to change how the world works. We're committed to creating category-leading enterprise software that unleashes that power.

To make that happen, we need people who are curious, self-propelled, generous, and genuine. People who love being part of a fast-moving, fast-thinking growth company. And people who care—about each other, about UiPath, and about our larger purpose.

Could that be you?

"Agentic(エージェンティック)"の最先端で一緒に働いてみませんか?

UiPathは、エンドツーエンドの業務自動化を通じて、これまで日本企業の効率化と変革を支えてきました。今、我々が注力しているのは「エージェンティックオートメーション」。AIエージェント、RPAのロボット、人

を連携させて、企業全体の業務を安全かつ安定的に自動化することです。

UiPath株式会社は本社直下のリージョンに昇格し、日本を最重要拠点と位置づける戦略のもと、日本から世界へソリューションを発信することを目指しています。UiPathは、好奇心旺盛で、自ら進んで動けるフットワークの軽い人材を求めています。ビジネスのスピードや変化を喜びとし、互いを思いやり、ともに成長し続けられる仲間が必要です。UiPathでエージェンティックオートメーションを実現し、共に社会を変革しましょう。

Role Overview

This is a high-impact, principal level role designed for an engineer who excels in the "heat of the moment". Operating with a high degree of autonomy, you will take operational leadership to restore the stability of UiPath's large-scale distributed services, blending deep technical SRE expertise with the authoritative presence of an Incident Commander.

You will partner closely with platform, infrastructure, and application teams globally to improve service availability, reduce operational toil, and ensure our systems scale reliably under real-world load and failure conditions.

You will act as the Japan regional owner for SRE standards and maintain a close partnership and functional alignment with UiPath's Global SRE organization. You will also own service reliability, observability, automation, and continuous improvement initiatives for the region.

You will report primarily to Senior Director of Japan and functionally to Vice President - SRE, based in U.S. You will also act in the managerial capacity with another team member reporting to you.

What You'll Be Working On

1. Incident Command & Tactical Response

• Lead Incident Command: Act as the primary Incident Commander for high-stakes technical events. Establish command and control, orchestrate cross-functional response efforts (Compute, Network, Storage, Database), and maintain a common operating picture for all stakeholders.

• Live Site Troubleshooting: Serve as a key escalation point for complex issues. Use your deep understanding of service topology and dependencies to diagnose "grey failure" and resolve disruptions promptly.

• Executive Communication: Own the communication life cycle. Deliver real-time, executive-level briefings during active incidents, translating technical jargon into clear business impact and recovery timelines for leadership.

2. Prevention & Reliability Engineering

• Post-Incident Evolution: Lead thorough retrospectives and RCAs. Beyond just documenting what happened, you will drive and influence the discovery and implementation of automated self-healing solutions to ensure the same issue never occurs twice.

• Observability: Define, track, and improve service health through promoting well-designed SLIs and SLOs. Influence and implement proactive monitoring, dashboards, and early-warning alerts to identify performance bottlenecks before they trigger an incident.

• Toil Automation: Design and implement automation to reduce manual intervention during incidents and routine operations. Apply engineering rigor to operational workflows to eliminate repetitive and error-prone tasks.

• Service Resilience: Understand the know-how to test service behavior under load, including degradation modes, scaling characteristics, and dependency failures. Ensure backup, restore, and disaster recovery capabilities are implemented, tested, and maintained.

3. Service Design & Cross-functional Leadership

• Architectural Partnership: Partner with development teams to champion high availability and readiness of the services and promote best practices on reliability, resilience, and operability.

• Team Mentorship: Advocate for SRE best practices. Mentor and support other engineers, helping raise the overall incident response and reliability maturity of the organization.

What You'll Bring to the Team

• Experience: 7+ years in SRE , Cloud Operations, or a related technical field, with at least 3 years in a lead responder or command-oriented role.

• Command Presence: Demonstrated ability to remain calm, focused, and decisive under extreme pressure. You can lead a room of diverse stakeholders and drive technical conversations to successful outcomes.

• Forensics & Investigation: Skills in analyzing system artifacts, network, and performance dashboard data to lead the multi-disciplinary audience to appropriate root cause areas of service failures.

• Technical Breadth: Strong proficiency in Python or Go and a holistic understanding of distributed systems, Kubernetes, and cloud infrastructure (Preferably Azure).

• Observability Expertise: Deep experience with leveraging Prometheus/Grafana, Open Telemetry or any other equivalent 3rd party Observability stack.

• Availability: Willingness to participate in the on-call rotation as an Incident Commander for high-severity issues.

Nice to have

• Command Frameworks: Familiarity with structured command systems (such as the Incident Command System - ICS) used in crisis management.

• LLM Ops: Experience using LLMs or AI-driven detection systems to solve reliability and capacity challenges in GPU-heavy, high-performance computing environments.

• AI Tooling: Champion the use of AI tools and LLM-powered agents to improve SRE pillars including, but not limited to, reducing operational toil.

• Event-Driven Remediation: Proven history of building "self-healing" infrastructure via Terraform, A zure Service Operator, or any other equivalent solutions.

Working Hours & Language Skills

• Working Hours: The role follows a standard work schedule starting at 8:00 a.m. Flexibility may be required to support on-call rotations and respond to incidents, particularly those affecting customers in Japan.

• Language Skills: Strong proficiency in English for effective communication with global functional team members, combined with Japanese proficiency to clearly convey incident details, root causes, and remediation plans to customers and local stakeholders in the Japanese market.

Maybe you don't tick all the boxes above—but still think you'd be great for the job? Go ahead, apply anyway. Please. Because we know that experience comes in all shapes and sizes—and passion can't be learned.

Many of our roles allow for flexibility in when and where work gets done. Depending on the needs of the business and the role, the number of hybrid, office-based, and remote workers will vary from team to team. Applications are assessed on a rolling basis and there is no fixed deadline for this requisition. The application window may change depending on the volume of applications received or may close immediately if a qualified candidate is selected.

We value a range of diverse backgrounds, experiences and ideas. We pride ourselves on our diversity and inclusive workplace that provides equal opportunities to all persons regardless of age, race, color, religion, sex, sexual orientation, gender identity, and expression, national origin, disability, neurodiversity, military and/or veteran status, or any other protected classes. Additionally, UiPath provides reasonable accommodations for candidates on request and respects applicants' privacy rights. To review these and other legal disclosures, visit our .

Site Reliability Engineer

4週間前

Greater Tokyo Area BLOOMTECH, Inc

+時価総額TOP100企業の7割以上が顧客の安定基盤、ハイブリッドワーク×フレックスタイム制で柔軟な働き方を実現、新製品のインフラ基盤をゼロから育てる面白さ。 · グローバル市場で戦う大手企業のグループ経営は、M&Aや海外展開により難易度がますます高まっています。 · 単なる保守運用にとどまらず、サービス設計から開発、長期的なブラッシュアップまで多岐にわたるフェーズに携わっていただきます。 · ...
Site Reliability Engineer

4週間前

Tokyo PowerX, Inc.

PowerXのサービスにおける高いクオリティで実現し、より迅速に・スマートにビジネスを推進させるためのシステム開発・運用を行うSRE/DevOpsチームでは、優秀なソフトウェアエンジニアを求めています。 · ...
Site Reliability Engineer

4週間前

Tokyo 株式会社パワーエックス

SRE/DevOpsチームでは、PowerXのサービスにおける重要な基盤を高いクオリティで実現し、より迅速に・スマートにビジネスを推進させるためのシステム開発・運用を行っています · 蓄電池を利用した新しいサービスにおける高い信頼性を実現するといったチャレンジ · 優秀なSWEと働くことのできる環境 · 自らが設計・技術選択を行い進めていくことができる · ...
PlayStationNetwork Site Reliability Engineer

1日前

Japan, Tokyo Sony Interactive Entertainment

PlayStation Network向けに設計、構築、運用するエンジニアリングチームのメンバーを募集しています。サービスの信頼性、性能、効率、セキュリティの確保を担っていただきます。 · ...
Network Site Reliability Engineer

2週間前

Tokyo PlayStation

PlayStation向けに提供しているネットワークサービス"PlayStation Network"を設計、構築、運用するエンジニアリングチームのメンバーを募集しています。 · ...
Software Engineer, Site Reliability

1日前

Tokyo Tailor

プロダクトづくりの難しい部分を簡単にし、誰もがプロダクトの作り手になれる。これがテイラーが実現したい世界です。誰しもが自分のアイディアを簡単に具現化でき、ビジネスとエンジニアリングの境界を取り払い、多様な専門知識と技術を統合できる世界を目指しています。 · フロントエンド・TailorPF上に構築されたバックエンドアプリケーションのCI/CD環境の構築 · ...
Site Reliability Engineer

1日前

Tokyo 株式会社トリファ￥6,000,000 - ￥12,000,000

AWS/GCPを中心としたインフラの設計・構築・運用を行います。また、CI/CDパイプラインの設計改善やモニタリング・ロギング基盤の整備運用も担当します。 · ...
1103_Site Reliability Engineer (SRE)

2ヶ月前

Tokyo TIER IV

Job summary/ · /き/ · /き/ · , Autoware-equipped self-driving vehicles around the world to ensure safety and reliability. ...
SRE(Site Reliability Engineer)

1ヶ月前

〒- 東京都品川区西五反田, 株式会社ロジレス

私たちは「ECロジスティクスを変革し、日本の未来をスケールする」というミッションのもと、約15兆円規模・成長率3.7%のEC市場に挑んでいます。人手不足や物流コスト増といった深刻な社会課題を解決し、エッチ事業者と倉庫事業者双方の生産性向上を実現することを目指しています。 · AWSを使うインフラ基盤を作って運用します. · モニタリングやログ分析などでシステムがどう動いているか確認します. · パフォーマンス最適化やボトルネック解消も担当します. · ...
SRE(Site Reliability Engineer)

1ヶ月前

東京都中央区銀座一丁目駅, 株式会社テックドクター Remote job

+ · た, , . · + · . · . · ...
Network Site Reliability Engineer

2ヶ月前

Tokyo PlayStation ￥3,600,000 - ￥12,000,000 per year

PlayStationNetworkの企画・設計・開発・運用を担っているエンジニアリング部門です。PlayStationのライフサイクルを構成する、クライアントソフトウェアからゲームコンテンツ配信・販売機能、オンラインゲーム機能、ソーシャルコミュニティ機能等のプラットフォームサービスまで、幅広くコンシューマーやゲームデベロッパーに提供しています。 · SITE RELIABILITY ENGINEERとしてサーバーサイドアプリケーション開発チームの一員としてサービスの信頼性、性能、効率およびセキュリティーの確保を担うこと。 · ...
SRE (Site Reliability Engineer)

3週間前

東京都中央区日本橋本町, Thinkings株式会社 Remote job￥4,200,000 per year

+Job summary · インフラ構築・運用の自動化や効率化、障害予防や影響を最小化するための監視やオブザーバビリティ基盤の構築と改善 · +Sonar ATSをはじめとする複数プロダクトの基盤となるインフラやCI/CD基盤の設計・構築・運用 · 各プロダクトのパフォーマンスやスケーラビリティの向上 · +SREもしくはインフラエンジニアとしての経験 3年以上 · + ...
Speeda - SRE (Site Reliability Engineer)

3週間前

東京都千代田区丸の内, 株式会社ユーザベース

+自社プロダクト「Speeda」を支えるハイブリッドクラウドの構築・運用を行ったり、パフォーマンスや信頼性、スケーラビリティを高めるエンジニアを募集しています。 · +オンプレミス、GCP、AWSを利用したハイブリッドクラウドの構築 · 開発チームと共にマイクロサービスの開発、運用 · Toil削減 · Docker,Kubernetes,Istioの運用 · ...
Senior Site Reliability Engineer /215918

3週間前

東京都港区東新橋, 株式会社UPSTART Remote job￥10,000,000 - ￥18,000,000 per year

クラウドインフラ・データ分析基盤に深い知見を持つプロダクトマネージャーおよび、dotData 製品開発チームのリーダー陣と協力しながら、製品やサービスに求められる可用性、信頼性、セキュリティなど要件および仕様を明確にしながら、システムアーキテクチャを漸進的に進化させたり、最新のテクノロジーをフル活用して運用の自動化・効率化をしたり、継続的な運用改善を行い、安定した品質で多くのお客様に利用されるサービスを継続的にリリースする役割です。また、中長期にはエンジニアリングマネージャーとして組織面でチームをリードしていく役割やスタッフエンジニアとして技術面でのチー ...
1103_Site Reliability Engineer (SRE)

4週間前

Tokyo TIER IV ￥5,800,000 - ￥16,500,000

インターｦＵＵＶ · ！ · ...
Software Engineer, Site Reliability 業務委託

1日前

Tokyo Tailor

プロダクトづくりの難しい部分を簡単にし、誰もがプロダクトの作り手になれる。これがテイラーが実現したい世界です。 · ...
SRE (Site Reliability Engineer) 業務委託

1日前

Tokyo Tailor

プロダクトづくりの難しい部分を簡単にし、誰もがプロダクトの作り手になれる。これがテイラーが実現したい世界です。誰しもが自分のアイディアを簡単に具現化でき、ビジネスとエンジニアリングの境界を取り払い、多様な専門知識と技術を統合できる世界を目指しています。 · ...
SRE(Site Reliability Engineer)

1ヶ月前

神奈川県横浜市港北区新横浜, NE株式会社￥6,000,000 - ￥8,000,000 per year

· NEについて NE株式会社は、EC市場において業界トップシェアを誇る EC一元管理SaaS「ネクストエンジン」を運営しているソフトウェア企業です。現在6,500社を超える多くのEC事業者の成長を支援しており 2025年11月に東証グロース市場に上場いたしました。ネクストエンジンは ...
Tech_SRE(Site Reliability Engineer/業務委託)

2ヶ月前

東京都港区虎ノ門, 株式会社TERASS ￥2,000,000 - ￥2,800,000 per year

TERASS(今国) に：" · ： · TERRA.. · SITE RELIABILITY ENGINEER) · ：SRE( · ...
業務委託】Site Reliability Engineer(SRE)

3週間前

東京都品川区西五反田, 株式会社エライク Remote job

+ · 仕事内容海外 e SIM アプリ「トリファ (trifa)」において、インフラ・信頼性・可用性を支える SRE 領域を担当していただきます。 SRE チーム立ち上げフェーズのため、運用改善・自動化・基盤整備を実務面から推進していただきます。主な業務内容・ GCP / AWS を用いたインフラ設計・運用・ CI / CD パイプラインの改善・運用・モニタリング・ロギング基盤の整備 · + · クラウドインフラ運用経験 (3 年以上) IaC の実務経験 CI / CD と障害対応経験可用性とセキュリティ意識した設計経験 · + ...
Customer Reliability Engineer

1ヶ月前

Tokyo LY Corporation

「LINE」において、Messaging PlatformやDeveloper Product Platformの社内外の顧客が抱える課題を深いドメイン知識と技術力を持って、カスタマーサポート(CS)チーム、開発チームと連携しながら、問題解決と支援ツールの開発をお任せします。所定のサービスレベル指標(SLI)・サービスレベル目標(SLO)に基づいてプラットフォームの品質モニタリングと顧客に対してSLA管理を行います。 · 日常的に発生するCS運用業務の技術的な支援調査や問題解決対応情報開示請求依頼に対するデータ抽出情報開示請求のツール開発 · ...

アメリカ大陸

ヨーロッパ

アジア / オセアニア

アフリカ

Life at UiPath