Role mission the primary mission of this role is to ensure the high availability and stability of enterprise applications running on ibm websphere and red hat openshift.
this is a support-centric position focused on proactive monitoring, rapid incident resolution (l3), and the continuous optimization of production environments to meet strict service level agreements (slas).
key responsibilities
incident & problem management (l3): act as the final point of technical escalation for complex outages involving websphere application server (was) and openshift clusters.
production stability: monitor environment health 24/7 using enterprise observability tools and execute immediate recovery actions during critical failures.
automation & scripting: develop and maintain bash and python scripts to automate repetitive support tasks, log collection, and automated health checks across the platform.
root cause analysis (rca): lead deep-dive investigations into jvm memory leaks, thread contention, and pod crashes to provide permanent fixes.
patching & lifecycle: execute platform upgrades, security patching, and configuration synchronization for both was (base/nd) and openshift environments.
observability: configure and maintain dashboards (grafana, prometheus, or apm tools) to track cluster performance and application health.
on-call rotation: participate in technical coverage for business-critical applications during high-priority incidents.
technical stack & requirements
must-have (technical core):
container support: 3+ years troubleshooting red hat openshift (v4.x), including sdn, ingress/routes, and persistent volumes.
middleware administration: expert knowledge of ibm websphere (base, nd, liberty), including profile management, ssl/tls certificates, and ihs.
advanced scripting: proven ability to create production-grade scripts in bash or python to interface with the openshift api (oc cli) and automate middleware tasks.
linux systems: deep knowledge of rhel (red hat enterprise linux) kernel parameters, networking diagnostics (tcpdump), and system performance tools.
observability tools: experience with monitoring stacks such as elk, dynatrace, appdynamics, or datadog.
nice-to-have (added value):
experience with ansible for configuration management and automated patching.
familiarity with itil frameworks (incident, problem, and change management).
knowledge of f5 big-ip or similar enterprise load balancers.
soft skills & competencies
sense of urgency: ability to remain effective and lead "war rooms" during high-priority production outages.
analytical thinking: methodical approach to isolating issues within complex, multi-layered architectures.
technical communication: capacity to translate complex infrastructure events into clear status updates for management.