InfraTailors.ai

SaaS product

Load your model in our prediction SaaS page and get a prediction in seconds.

API

Use our APIs to gain programmatic access.

Optimization

Use our technology to optimize your models.

From our blog

Read about our technologies here.

Fully Sharded Data Parallel (FSDP)

By Karthik Panner Selvam on February 12, 2025

Introduction

Training large language models on a multi-node, multi-GPU system can be a challenging task. Thanks to Fully Sharded Data Parallel (FSDP), a novel approach designed to enhance training efficiency by optimizing memory usage and scalability. It achieves this by sharding, or partitioning, the model’s parameters across all available GPUs in a distributed computing environment. Each GPU holds only a fraction of the total model parameters, reducing the memory requirements significantly compared to traditional methods. In this blog post, I will explain the mechanism behind FSDP and the practical code implementation of the fine-tuning Llama-3 8B model using FSDP+4Bit+PEFT.

Rotatory Position Embedding (RoPE)

By Karthik Panner Selvam on February 12, 2025

Introduction

Positional embedding is a crucial part of transformer models such as BERT and GPT. Unlike traditional models such as RNNs or LSTMs that understand the order of input sequences through sequential processing, transformers consider input sequences as unordered sets. This approach improves computational efficiency and overall performance, but it doesn’t account for the natural order of tokens, which is essential for understanding text. This limitation exists because the transformer architecture relies on self-attention mechanisms, which are permutation-invariant. This means they treat all positions equally, regardless of the element arrangement in the sequence. Positional embeddings address this shortcoming by integrating the sequence’s order into the model’s inputs, enabling the model to maintain awareness of token positions. This understanding is essential for language tasks such as translation, generation, and comprehension, where changing the word sequence can drastically change sentence meaning. For example, “The cat sat on the mat” and “The mat sat on the cat” have the exact words but different meanings due to word order.

Graph Neural Network: In a Nutshell

By Karthik Panner Selvam on February 12, 2025

Graph Neural Networks (GNNs) are a type of deep learning model that are specifically designed to identify and interpret complex patterns and dependencies within data that is structured as a graph. Unlike traditional neural networks that assume data instances are independent and identically distributed, GNNs leverage the connections between data points. This makes them particularly useful for relational data that is represented as graphs.

General Mathematics of GNNs

GNN operates on a graph denoted as \(G = (V, E)\), where \(V\) represents the set of nodes and \(E\) represents the set of edges connecting these nodes. Each node \(v\) in \(V\) is associated with a feature vector \(X_v\) which encapsulates the attributes or properties of that node.

Performance prediction

High prediction accuracy

Advanced model architecture

Large Language Models

SaaS product

API

Optimization

From our blog

Fully Sharded Data Parallel (FSDP)

Introduction

Rotatory Position Embedding (RoPE)

Introduction

Graph Neural Network: In a Nutshell

General Mathematics of GNNs

Our Supporters