★Infrastructure Platform Engineer | Data Center Services
- English ONLY OK
◆ Start-up Company
◆ Hybrid Work
◆ Own Products/Services
◆ Global Environment
◆ Annual salary: 8 million yen -14 million yen
【About the company】
・Data Center Development Support
・Data Center Maintenance and Operations
・Development and Operation of Cloud Services for Data Centers
【 Job Description】
About the Product
The product is building a high-performance AI compute platform in Japan. The platform must behave like a real cloud: predictable, observable, secure, and resilient—especially under multi-tenant load and fast-moving hardware/software stacks.
The company operates an environment where hardware, drivers, and serving runtimes evolve quickly. The company's infrastructure goal is not just "keep the cluster up," but to build a platform that is predictable and reliable in production, can be changed safely, and can be debugged quickly when something goes wrong.
Role overview
As an Infrastructure Engineer, you will own the GPU platform that runs production inference: cluster architecture, deployment reliability, observability, capacity management, and incident response mechanisms. Your job is to make the platform predictable and reliable—even as the company scales hardware, models, tenants, and traffic patterns.
You'll work closely with serving/runtime and gateway teams to ensure the platform enforces the right isolation, exposes the right telemetry, and supports safe changes without downtime. This role blends strong systems intuition with real production discipline: reliable rollouts, clean operational tooling, and fast incident response.
Responsibilities
●Own GPU cluster architecture and operations: provisioning, node images, driver/runtime lifecycle, GPU plugin/operator lifecycle, and standardized deployment patterns for serving pools and system services.
●Define and maintain the production baseline: golden node configurations, cluster hardening, upgrade paths, and "known good" compatibility matrices (drivers CUDA runtime kernel).
●Build reliability into the platform: SLOs/SLIs, alerting quality, runbooks, incident tooling, and postmortems with real follow-through (automation, guardrails, and elimination of repeat incidents).
●Enable safe delivery: canary deploys, progressive rollouts, rollback paths, and configuration safety (validation, guardrails, change controls, and safe defaults).
●Own fleet health and maintenance workflows: node draining, GPU quarantining, automated remediation, scheduled maintenance, and safe "break-glass" procedures with auditability.
●Capacity and utilization: scheduling constraints, binpacking/fragmentation management, warm pools, autoscaling primitives, and quota enforcement hooks that align with product tiers and fairness goals.
●Observability: metrics/logs/tracing across gateway → serving → GPU; latency breakdowns, saturation signals, queue depth, GPU memory/compute metrics, and fleet health dashboards that help correlate customer symptoms to root causes.
●Production readiness for heterogeneous environments: manage differences across hardware generations and evolving server platforms, minimizing reliability risk while improving utilization.
●Security baseline: secrets management, least-privilege access, audit trails for operator actions, and secure operational workflows.
●Partner with networking: topology, failure domains, load balancing, and performance-sensitive traffic paths that impact tail latency and availability.
●Build operational tooling: fleet management, debugging workflows, safe admin actions, capacity tooling, and maintenance automation that reduces MTTR and improves operator efficiency.
●Collaborate across teams: align rollout plans, health semantics, capacity signals, and failure handling so the entire platform behaves predictably under load.
【 Requirements】
Requirements
●5+ years in infrastructure/SRE/platform engineering for production distributed systems.
●Strong Kubernetes experience in production (or equivalent orchestration), with real ops ownership.
●Experience operating GPU clusters or other high-performance compute fleets (or similarly performance-sensitive infrastructure).
●Strong debugging skills across Linux, networking, and distributed systems failure modes.
●Strong operational discipline: automation-first mindset, measurable reliability, careful change management, clear communication during incidents.
●Willing to participate in an on-call rotation for owned systems.
Nice to have
●Experience with high-throughput gateways/service meshes (e.g., Envoy), OpenTelemetry, and multi-region architectures.
●Experience with Slurm/HPC-style scheduling, RDMA/IB, or performance-sensitive networking.
●Experience building internal developer platforms and "golden paths" for consistent deploy/rollback workflows.
●Experience managing GPU driver/runtime upgrades safely across a fleet (compatibility testing + staged rollouts).
●Familiarity with observability patterns for latency-sensitive systems (request correlation, sampling strategy, high-cardinality metrics control).
【Working Time 】
09:00 ~ 18:00
【 Welfare 】
・Transportation Allowance: Partially provided (up to 15,000 yen per month).
・Social Insurance: Health insurance, Employees' Pension, employment insurance, and industrial accident compensation insurance.
・Overtime Allowance: Standard overtime pay provided.
【 Holiday 】
・Annual Holidays: 120 days.
・Work System: Full five-day workweek system.
・Annual Paid Leave: A minimum of 10 days or more provided starting from the 7th month of employment.
-
Kanagawa G TalentThe company is building a high-performance AI compute platform in Japan. · Own GPU cluster architecture and operations: provisioning, node images, driver/runtime lifecycle. · ...
-
Kanagawa G TalentAs a Backend Engineer you will build and power the customer request lifecycle end to end including handling requests via Cloudflare worker nodes managing authentication and tenancy validating requests routing to the correct model endpoints enforcing quotas and rate limits impleme ...
-
Kanagawa G TalentWe are looking for a Backend Engineer to join our team. As a Backend Engineer, you will be responsible for the control-plane and gateway layer that connects the customer to our compute serving infrastructure. You will build and power the customer request lifecycle end to end. · A ...
-
Procurement Engineer
3週間前
Kanagawa Easy SkillJoin us at Easy SkillWe supercharge teams around the world and transform people into a competitive advantage.Full Cycle Procurement : Manages end-to-end procurement for engineering construction technical equipment packages. · Bidding & Evaluation : Prepares issues RFQs RFPs perfo ...
-
Kanagawa Bosch Japan InternshipSHIPYou will be part of the team working for a global market and have great diversity. · Advanced skill to analyze and visualize data by MS office. · ...
-
Kanagawa G TalentThe company is currently driving a major system renewal project using a cloud platform, aligning with changes in its existing businesses and its mid-term management strategy. This initiative targets all business and operational areas. · Leading Corporate Systems Transformation · ...
-
Kanagawa G Talent Full time+Job summary · Internal Systems Engineer (Network/Infrastructure) | Major Semiconductor Trading Company. · +ResponsibilitiesYou'll be responsible for the planning, building, and operation of domestic and international networks. · ...
-
Kanagawa G Talent Full time¥5,000,000 - ¥9,000,000 per yearWe are looking for an Internal Systems Engineer to join our team. As an Internal Systems Engineer, you will be responsible for driving IT strategy and helping us maintain and improve our competitiveness. · ...
-
Senior Accountant
2週間前
Kanagawa EdgeCortixWe are looking for a hands-on Senior Accountant who can own the monthly close, support audit/tax compliance, and help build the foundational processes required for a high-growth startup environment. · Own the monthly, quarterly, and annual closing process for all entities. · Prep ...
-
Kanagawa JASMJASM is TSMC's first manufacturing facility in Kumamoto Prefecture, Japan. It was established to meet the strong global demand for semiconductors. · ...
-
Kanagawa myGwork - LGBTQ+ Business Community+This job is part of Amazon's Worldwide Operations Security organization. We are looking for a proactive · and solution-driven professional to join our team as Security and Loss Prevention (SLP) Expert. · ++A bachelor degree or degree equivalent · Relevant security- or risk-rela ...
-
Yonezawa, Yamagata EntegrisThe Entegris Yonezawa Plant is seeking a Senior Manager to join our team. This role will be responsible for planning future manufacturing capacity expansion and maintaining facility equipment to support stable production. · Maintenance and management of factory facilities; planni ...
-
Yonezawa, Yamagata Entegris ¥5,000,000 - ¥10,000,000 per yearWe are seeking a Control System Engineer to join our team in Yonezawa Japan. · Sentences omitted for brevity.The Manufacturing Systems Solutions Group is dedicated to achieving operational excellence throughout our global manufacturing processes by leveraging digital platforms an ...