- Career Center Home
- Search Jobs
- Site Reliability Engineer III
Results
Job Details
Explore Location
JPMorganChase
Bengaluru, Indiana, India
(on-site)
Posted
1 day ago
JPMorganChase
Bengaluru, Indiana, India
(on-site)
Job Type
Full-Time
Job Function
Banking
Site Reliability Engineer III
The insights provided are generated by AI and may contain inaccuracies. Please independently verify any critical information before relying on it.
Site Reliability Engineer III
The insights provided are generated by AI and may contain inaccuracies. Please independently verify any critical information before relying on it.
Description
There's nothing more exciting than being at the center of a rapidly growing field in technology and applying your skillsets to drive innovation and modernize the world's most complex and mission-critical systems.As a Site Reliability Engineer III at JPMorgan Chase within the Asset & Wealth Management, you will solve complex and broad business problems with simple and straightforward solutions. Through code and cloud infrastructure, you will configure, maintain, monitor, and optimize applications and their associated infrastructure to independently decompose and iteratively improve on existing solutions. You are a significant contributor to your team by sharing your knowledge of end-to-end operations, availability, reliability, and scalability of your application or platform.
Job responsibilities
- Develops and refine Service Level Objectives( including metrics like accuracy, fairness, latency, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token)) for large language model serving and training systems, balancing availability/latency with development velocity
- Designs, implement and continuously improve monitoring systems including availability, latency and other salient metrics
- Collaborates in the design and implementation of high-availability language model serving infrastructure capable of handling the needs of high-traffic internal workloads
- Champions site reliability culture and practices, providing technical leadership and influence across teams to foster a culture of reliability and resilience
- Develops and manage automated failover and recovery systems for model serving deployments across multiple regions and cloud providers
- Develops AI Incident Response playbooks for AI-specific failures like sudden drift or bias spikes, including automated rollbacks and AI circuit breakers.
- Leads incident response for critical AI services, ensuring rapid recovery and systematic improvements from each incident
- Builds and maintain cost optimization systems for large-scale AI infrastructure, ensuring efficient resource utilization without compromising performance.
- Engineers for Scale and Security, leveraging techniques like load balancing, caching, optimized GPU scheduling, and AI Gateways for managing traffic and security.
- Collaborates with ML engineers to ensure seamless integration and operation of AI infrastructure, bridging the gap between development and operations. Implements Continuous Evaluation, including pre-deployment, pre-release, and continuous post-deployment monitoring for drift and degradation.
Required qualifications, capabilities, and skills
- Formal training or certification on software engineering concepts and 3+ years applied experience
- Demonstrated proficiency in reliability, scalability, performance, security, enterprise system architecture, toil reduction, and other site reliability best practices
- Proficient knowledge and experience in observability such as white and black box monitoring, service level objective alerting, and telemetry collection using tools such as Grafana, Dynatrace, Prometheus, Datadog, Splunk, and others
- Proficient with continuous integration and continuous delivery tools like Jenkins, GitLab, or Terraform. Proficient with container and container orchestration: (ECS, Kubernetes, Docker)
- Experience with troubleshooting common networking technologies and issues
- Understand the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines
- Have proven experience implementing and maintaining SLO/SLA frameworks for business-critical services
- Are comfortable working with both traditional metrics (latency, availability) and AI-specific metrics (model performance, training convergence)
- Can effectively bridge the gap between ML engineers and infrastructure teams
Preferred qualifications, capabilities, and skills
- Experience with AI-specific observability tools and platforms, such as OpenTelemetry and OpenInference. Familiarity with AI incident response strategies, including automated rollbacks and AI circuit breakers.
- Knowledge of AI-centric SLOs/SLAs, including metrics like accuracy, fairness, drift targets, TTFT (Time To First Token), and TPOT (Time Per Output Token).
- Expertise in engineering for scale and security, including load balancing, caching, optimized GPU scheduling, and AI Gateways.
- Experience with continuous evaluation processes, including pre-deployment, pre-release, and post-deployment monitoring for drift and degradation.
Job ID: 82100283
Please refer to the company's website or job descriptions to learn more about them.
View Full Profile
More Jobs from JPMorganChase
Investment Banking - Technical Transaction Team, Energy - Analyst Engineer
Houston, Texas, United States
20 hours ago
Risk Management - Market Risk Coverage - Equities Associate
New York, New York, United States
20 hours ago
Software Engineer II- Test Automation
Jersey City, New Jersey, United States
20 hours ago
Jobs You May Like
Showing data for the city of Bangalore, India.
Median Salary
Net Salary per month
$961
Cost of Living Index
22/100
22
Median Apartment Rent in City Center
(1-3 Bedroom)
$336
-
$886
$611
Safety Index
46/100
46
Utilities
Basic
(Electricity, heating, cooling, water, garbage for 915 sq ft apartment)
$18
-
$55
$29
High-Speed Internet
$7
-
$13
$10
Transportation
Gasoline
(1 gallon)
$4.39
Taxi Ride
(1 mile)
$0.53
Data is collected and updated regularly using reputable sources, including corporate websites and governmental reporting institutions.
Loading...
