← Back to Projects

Jass — Reinforcement Learning

2026 — ongoing · Side project
Reinforcement Learning PPO DQN Rainbow DQN Monte Carlo Tree Search Python PyTorch

What is Jass?

Jass is the most popular card game in Switzerland, played with a 36-card deck. It is a trick-taking game for four players split into two teams of two. Each round, one player selects a trump suit that outranks all others, and players take turns playing a card to win tricks. Points are scored based on which cards are captured, with the total per round fixed at 157. The team that reaches a target score first wins the match.

The game requires both short-term tactical decisions — which card to play to win or lose a trick on purpose — and longer-term strategic coordination with a partner who sits across the table and whose hand you cannot see.

Why Jass is a compelling RL problem

Games like chess or Go have been landmark environments for RL research, but they share a property that makes them fundamentally simpler from an information standpoint: perfect information. Every player sees the full game state at all times. This means that in principle, search algorithms like MCTS or minimax can evaluate any position exactly — they just need enough compute.

Jass is a hidden-information game. Each player holds 9 cards that no one else can see. The hands of your partner and both opponents are unknown throughout the game. This fundamentally changes the problem: the agent cannot reason about a single game tree — it must reason under uncertainty about a distribution of possible worlds consistent with the cards it has observed so far. Standard tree search either becomes intractable or requires explicit handling of stochasticity and partial observability.

On top of hidden information, Jass introduces a cooperative-competitive structure: the agent must coordinate implicitly with its partner (no communication allowed) while competing against the opposing team. This mix of collaboration and adversarial play, combined with a delayed and sparse reward signal (points are only counted at the end of each round), makes it a rich benchmark for modern RL methods.

The core difficulty: an agent cannot know what its partner is holding or what the opponents are planning. Every decision must be made under uncertainty — both about the current state and about the long-term intentions of the other three players.

Learning context

This project is driven by personal interest in RL and a desire to apply what I have been learning systematically. The theoretical foundations come from RL courses taken at EPFL, and the practical side has been reinforced by following the Velocity Labs RL course on Udemy — working through algorithm implementations from first principles before applying them to Jass.

Agents implemented

Random agent — plays a uniformly random valid card at each step. Serves as the lower bound: any agent that cannot beat this consistently is not learning anything useful.

Rule-based agent — follows a set of handcrafted heuristics encoding basic Jass strategy: play the highest trump when leading, avoid wasting high-value cards on already-won tricks, follow suit when forced. This is a strong baseline that captures the kind of common-sense knowledge a beginner human player would apply.

PIMC — Perfect Information Monte Carlo — a search-based agent designed specifically for hidden-information games. At each decision point, PIMC samples a large number of possible card distributions for the unseen hands (consistent with the cards played so far), solves each sampled world as a perfect-information game using a tree search, and aggregates the results to select the action that performs best on average across all sampled worlds. This is sometimes called "determinisation" — the agent temporarily assumes it can see all cards, searches that simpler problem, then repeats over many samples to account for uncertainty. PIMC produces strong play without any learning, but its quality depends on the number of simulations it can run per move.

DQN with Prioritized Experience Replay (PER) — a Deep Q-Network that learns a value function mapping (state, action) pairs to expected future reward. The key addition over vanilla DQN is Prioritized Experience Replay: instead of sampling transitions uniformly from the replay buffer, transitions are sampled with probability proportional to their temporal-difference error — the bigger the surprise, the more frequently the transition is revisited for learning. This focuses training on the most informative experiences and accelerates convergence. The priority queue is implemented as a sum tree, a binary tree data structure where each leaf stores a transition's priority and each internal node stores the sum of its children — enabling O(log n) sampling and update operations rather than O(n) linear scans over the buffer.

PPO — Proximal Policy Optimisation — a policy gradient method that directly learns a stochastic policy rather than a value function. PPO's key contribution is a clipped surrogate objective that prevents the policy update from moving too far from the current policy in a single step — avoiding the catastrophic policy collapses that plagued earlier policy gradient methods. Unlike DQN, PPO operates on-policy: it collects a batch of experience with the current policy, updates, then discards that experience. This makes it less sample-efficient than DQN in theory, but its stability and its natural fit for continuous and stochastic action distributions make it a strong choice for complex environments.

Tournament results

Each agent pair was evaluated over 100–1000 games (depending on computational cost). The table reports win rate and average point differential per game from the perspective of the first-named agent. A total of 157 points are distributed per round; a positive score diff indicates the first agent is consistently winning more than half the points.

Matchup Win rate Avg score diff Games
PPO vs Random 84.1% +67.1 1 000
PPO vs Rule-based 63.4% +29.1 1 000
PPO vs DQN 72.6% +49.7 1 000
PPO vs PIMC 50.0% +10.4 100
PIMC vs Random 78.0% +59.0 100
PIMC vs Rule-based 61.0% +16.3 100
Rule-based vs Random 74.5% +45.5 1 000
DQN vs Random 59.3% +17.7 1 000
DQN vs Rule-based 39.7% −31.7 1 000

The overall hierarchy that emerges is: PPO ≈ PIMC > Rule-based > DQN > Random. PPO is the strongest learned agent — it comfortably beats the rule-based agent and DQN, and matches PIMC head-to-head despite PIMC having access to explicit search over simulated game worlds. DQN beats the random baseline but struggles against the rule-based agent, suggesting it has learned basic card play but not yet the strategic depth that heuristic knowledge encodes. The 50% tie between PPO and PIMC is notable: a purely learned policy, with no lookahead or simulation at inference time, reaches the same level as a search-based agent that explicitly reasons about uncertainty.

What's next

This project is ongoing. The next planned steps are:

  • Rainbow DQN — combining the key improvements to DQN (PER, dueling networks, multi-step returns, distributional RL, noisy nets) into a single agent to measure how much each component contributes above the current PER-DQN baseline.
  • Improved MCTS — exploring a more principled tree search under imperfect information, moving beyond the determinisation approach of PIMC toward methods that reason directly over belief states.
  • AlphaZero-style training — combining MCTS-guided self-play with a learned policy-value network, following the approach that produced superhuman performance in chess and Go, adapted to the hidden-information setting of Jass.

The goal is not only to improve performance but to understand what each algorithmic family contributes and where the hidden-information structure of Jass creates hard limits for each approach.