Senior Research Engineer - Training Efficiency
Company: Luma AI
Location: Palo Alto
Posted on: March 15, 2025
Job Description:
Luma's mission is to build multimodal AI to expand human
imagination and capabilities. We believe that multimodality is
critical for intelligence. To go beyond language models and build
more aware, capable and useful systems, the next step function
change will come from vision. So, we are working on training and
scaling up multimodal foundation models for systems that can see
and understand, show and explain, and eventually interact with our
world to effect change. We are looking for engineers with
significant experience solving hard problems in PyTorch, CUDA and
distributed systems. You will work alongside the rest of the
research team to build & train cutting edge foundation models on
thousands of GPUs that are built to scale from the ground
up.Responsibilities
- Ensure efficient implementation of models & systems with a
focus on large-scale training.
- Identify and implement optimization techniques for massively
parallel and distributed systems, including the underlying
communication layer.
- Identify and remedy efficiency bottlenecks (memory, speed,
utilization, communication) by profiling and implementing
high-performance PyTorch code, deferring to Triton, CUDA, and lower
levels as necessary.
- Work closely together with the rest of the research team to
ensure systems are planned to be as efficient as possible from
start to finish.
- Conduct research & experiments on state-of-the-art large-scale
generative AI models with the goal to improve latency & throughput
for training and inference.Must have experience
- Experience training large models using Python & Pytorch,
including practical experience working with the full development
pipeline from data processing, preparation & dataloading to
training and inference.
- Experience profiling GPU & CPU code in Pytorch for optimal
device utilization (examples: torch profiler, NVIDIA Nsight
systems/compute, memory profilers, trace viewers, custom
tooling).
- Experience writing & improving highly parallel & distributed
Pytorch code of large generative models, with familiarity in FSDP,
Tensor Parallel, Sequence/Context Parallel, Pipeline Parallel
etc.
- Experience working with transformer models and attention
implementations.Good to have experience
- Experience with high-performance Triton/CUDA and writing custom
PyTorch kernels and ops. Top candidates will be able to write fused
kernels for common hot paths, understand when to make use of lower
level features like tensor cores or warp intrinsics, and will
understand where these tools can be most impactful.
- Experience writing high-performance parallel C++. Bonus if done
within an ML context with Pytorch, like for data loading, data
processing, inference code.
- Experience building inference / demo prototype code (incl.
Gradio, Docker etc.).
#J-18808-Ljbffr
Keywords: Luma AI, Palo Alto , Senior Research Engineer - Training Efficiency, Engineering , Palo Alto, California
Didn't find what you're looking for? Search again!
Loading more jobs...