ppo Agent playing Huggy
This is a trained model of a ppo agent playing Huggy using the Unity ML-Agents Library.
Huggy PPO Agent - Training Documentation
Model Overview
Huggy is a PPO (Proximal Policy Optimization) agent trained using Unity ML-Agents toolkit. This is a custom Unity environment where the agent learns to perform specific behaviors over 2 million training steps.
Training Environment
- Environment: Unity ML-Agents custom environment "Huggy"
- ML-Agents Version: 1.2.0.dev0
- ML-Agents Envs: 1.2.0.dev0
- Communicator API: 1.5.0
- PyTorch Version: 2.7.1+cu126
- Unity Package Version: 2.2.1-exp.1
Training Configuration
PPO Hyperparameters
- Batch Size: 2,048
- Buffer Size: 20,480
- Learning Rate: 0.0003 (linear schedule)
- Beta (entropy regularization): 0.005 (linear schedule)
- Epsilon (PPO clip parameter): 0.2 (linear schedule)
- Lambda (GAE parameter): 0.95
- Number of Epochs: 3
- Shared Critic: False
Network Architecture
- Normalization: Enabled
- Hidden Units: 512
- Number of Layers: 3
- Visual Encoding Type: Simple
- Memory: None
- Goal Conditioning Type: Hyper
- Deterministic: False
Reward Configuration
- Reward Type: Extrinsic
- Gamma (discount factor): 0.995
- Reward Strength: 1.0
- Reward Network Hidden Units: 128
- Reward Network Layers: 2
Training Parameters
- Maximum Steps: 2,000,000
- Time Horizon: 1,000
- Summary Frequency: 50,000 steps
- Checkpoint Interval: 200,000 steps
- Keep Checkpoints: 15
- Threaded Training: False
Training Performance
Performance Progression
The agent showed steady improvement throughout training:
Early Training (0-200k steps):
- Step 50k: Mean Reward = 1.840 ± 0.925
- Step 100k: Mean Reward = 2.747 ± 1.096
- Step 150k: Mean Reward = 3.031 ± 1.174
- Step 200k: Mean Reward = 3.538 ± 1.370
Mid Training (200k-1M steps):
- Performance stabilized around 3.6-3.9 mean reward
- Peak performance at 500k steps: 3.873 ± 1.783
Late Training (1M-2M steps):
- Consistent performance around 3.5-3.8 mean reward
- Final performance at 2M steps: 3.718 ± 2.132
Key Performance Metrics
- Training Duration: 2,350.439 seconds (~39 minutes)
- Final Mean Reward: 3.718
- Final Standard Deviation: 2.132
- Peak Mean Reward: 3.873 (at 500k steps)
- Lowest Standard Deviation: 0.925 (at 50k steps)
Training Characteristics
Learning Curve Analysis
- Rapid Initial Learning: Significant improvement in first 200k steps (1.84 → 3.54)
- Plateau Phase: Performance stabilized between 200k-2M steps
- Variance Increase: Standard deviation increased over time, indicating more diverse behavior patterns
Model Checkpoints
Regular ONNX model exports were created every 200k steps:
- Huggy-199933.onnx
- Huggy-399938.onnx
- Huggy-599920.onnx
- Huggy-799966.onnx
- Huggy-999748.onnx
- Huggy-1199265.onnx
- Huggy-1399932.onnx
- Huggy-1599985.onnx
- Huggy-1799997.onnx
- Huggy-1999614.onnx
- Final Model: Huggy-2000364.onnx
Technical Implementation
Training Framework
- Unity ML-Agents with PPO algorithm
- Custom Unity environment integration
- ONNX model export for deployment
- Real-time training monitoring
Model Architecture Details
- Multi-layer perceptron with 3 hidden layers
- 512 hidden units per layer
- Input normalization enabled
- Separate actor-critic networks (shared_critic = False)
- Hypernetwork goal conditioning
Reward Signal Processing
- Single extrinsic reward signal
- Discount factor of 0.995 for long-term planning
- Dedicated reward network with 2 layers and 128 units
Performance Insights
Strengths
- Consistent learning progression
- Stable final performance around 3.7 mean reward
- Successful completion of 2M training steps
- Regular checkpoint generation for model versioning
Observations
- Standard deviation increased over training, suggesting the agent learned more diverse strategies
- Performance plateau after 200k steps indicates the task complexity was well-matched to the training duration
- The agent maintained stable performance without significant degradation
Training Efficiency
- Steps per Second: ~851 steps/second average
- Episodes per Checkpoint: Approximately 200-250 episodes per checkpoint
- Memory Usage: Efficient with 20,480 buffer size and 1,000 time horizon
This training session demonstrates successful PPO implementation in a Unity environment with consistent performance and robust learning characteristics.
Huggy PPO Agent - Usage Guide
Prerequisites
Before using the Huggy model, ensure you have the following installed:
# Install Unity ML-Agents
pip install mlagents==1.2.0
# Install required dependencies
pip install torch==2.7.1
pip install onnx
pip install onnxruntime
Model Files
After training, you'll have these key files:
- Huggy.onnx - The trained model (final version)
- Huggy-2000364.onnx - Final checkpoint model
- config.yaml - Training configuration file
- training logs - Performance metrics and tensorboard data
Loading and Using the Model
Method 1: Using ML-Agents Python API
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.base_env import ActionTuple
import numpy as np
# Load the Unity environment
env = UnityEnvironment(file_name="path/to/your/huggy_environment")
# Reset the environment
env.reset()
# Get behavior specs
behavior_names = list(env.behavior_specs.keys())
behavior_name = behavior_names[0] # "Huggy"
spec = env.behavior_specs[behavior_name]
print(f"Observation space: {spec.observation_specs}")
print(f"Action space: {spec.action_spec}")
Method 2: Using ONNX Runtime for Inference
import onnxruntime as ort
import numpy as np
# Load the trained ONNX model
model_path = "results/Huggy2/Huggy.onnx"
ort_session = ort.InferenceSession(model_path)
# Get model input/output info
input_name = ort_session.get_inputs()[0].name
output_name = ort_session.get_outputs()[0].name
def predict_action(observation):
"""
Predict action using the trained model
"""
# Prepare observation (ensure correct shape and normalization)
obs_input = np.array(observation, dtype=np.float32)
# Run inference
action_probs = ort_session.run([output_name], {input_name: obs_input})
# Sample action from probabilities or take deterministic action
action = np.argmax(action_probs[0]) # Deterministic
# OR: action = np.random.choice(len(action_probs[0]), p=action_probs[0]) # Stochastic
return action
Method 3: Running Trained Agent in Unity
from mlagents_envs.environment import UnityEnvironment
from mlagents_envs.base_env import ActionTuple
import onnxruntime as ort
import numpy as np
# Initialize environment and model
env = UnityEnvironment(file_name="HuggyEnvironment")
ort_session = ort.InferenceSession("results/Huggy2/Huggy.onnx")
# Get behavior name
behavior_names = list(env.behavior_specs.keys())
behavior_name = behavior_names[0]
# Run episodes
for episode in range(10):
env.reset()
decision_steps, terminal_steps = env.get_steps(behavior_name)
episode_reward = 0
step_count = 0
while len(decision_steps) > 0:
# Get observations
observations = decision_steps.obs[0]
# Predict actions using trained model
actions = []
for obs in observations:
action_probs = ort_session.run(None, {"obs_0": obs.reshape(1, -1)})
action = np.argmax(action_probs[0])
actions.append(action)
# Send actions to environment
action_tuple = ActionTuple(discrete=np.array([actions]))
env.set_actions(behavior_name, action_tuple)
# Step environment
env.step()
decision_steps, terminal_steps = env.get_steps(behavior_name)
# Track rewards
if len(terminal_steps) > 0:
episode_reward += terminal_steps.reward[0]
break
if len(decision_steps) > 0:
episode_reward += decision_steps.reward[0]
step_count += 1
print(f"Episode {episode + 1}: Reward = {episode_reward:.3f}, Steps = {step_count}")
env.close()
Evaluation and Testing
Performance Evaluation Script
import numpy as np
from collections import defaultdict
def evaluate_model(env, model_session, num_episodes=100):
"""
Evaluate the trained model performance
"""
results = {
'rewards': [],
'episode_lengths': [],
'success_rate': 0
}
behavior_name = list(env.behavior_specs.keys())[0]
for episode in range(num_episodes):
env.reset()
decision_steps, terminal_steps = env.get_steps(behavior_name)
episode_reward = 0
episode_length = 0
while len(decision_steps) > 0:
# Get actions from model
observations = decision_steps.obs[0]
actions = []
for obs in observations:
action_probs = model_session.run(None, {"obs_0": obs.reshape(1, -1)})
action = np.argmax(action_probs[0]) # Deterministic policy
actions.append(action)
# Step environment
action_tuple = ActionTuple(discrete=np.array([actions]))
env.set_actions(behavior_name, action_tuple)
env.step()
decision_steps, terminal_steps = env.get_steps(behavior_name)
episode_length += 1
# Check for episode termination
if len(terminal_steps) > 0:
episode_reward = terminal_steps.reward[0]
break
results['rewards'].append(episode_reward)
results['episode_lengths'].append(episode_length)
# Calculate statistics
mean_reward = np.mean(results['rewards'])
std_reward = np.std(results['rewards'])
mean_length = np.mean(results['episode_lengths'])
print(f"Evaluation Results ({num_episodes} episodes):")
print(f"Mean Reward: {mean_reward:.3f} ± {std_reward:.3f}")
print(f"Mean Episode Length: {mean_length:.1f}")
print(f"Min Reward: {np.min(results['rewards']):.3f}")
print(f"Max Reward: {np.max(results['rewards']):.3f}")
return results
Deployment Options
Option 1: Unity Standalone Build
- Build your Unity environment with the trained model
- The model will automatically use the ONNX file for inference
- Deploy as a standalone executable
Option 2: Python Integration
# For integration into larger Python applications
class HuggyAgent:
def __init__(self, model_path):
self.session = ort.InferenceSession(model_path)
self.input_name = self.session.get_inputs()[0].name
def act(self, observation):
"""Get action from observation"""
obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
action_probs = self.session.run(None, {self.input_name: obs_input})
return np.argmax(action_probs[0])
def act_stochastic(self, observation):
"""Get stochastic action from observation"""
obs_input = np.array(observation, dtype=np.float32).reshape(1, -1)
action_probs = self.session.run(None, {self.input_name: obs_input})[0]
return np.random.choice(len(action_probs), p=action_probs)
# Usage
agent = HuggyAgent("results/Huggy2/Huggy.onnx")
action = agent.act(current_observation)
Option 3: Web Deployment
# For web applications using Flask/FastAPI
from flask import Flask, request, jsonify
import onnxruntime as ort
import numpy as np
app = Flask(__name__)
model = ort.InferenceSession("Huggy.onnx")
@app.route('/predict', methods=['POST'])
def predict():
data = request.json
observation = np.array(data['observation'], dtype=np.float32)
action_probs = model.run(None, {"obs_0": observation.reshape(1, -1)})
action = int(np.argmax(action_probs[0]))
return jsonify({'action': action, 'confidence': float(np.max(action_probs[0]))})
if __name__ == '__main__':
app.run(debug=True)
Troubleshooting
Common Issues
ONNX Model Loading Errors
- Ensure ONNX runtime version compatibility
- Check model file path and permissions
Unity Environment Connection
- Verify Unity environment executable path
- Check port availability (default: 5004)
Observation Shape Mismatches
- Ensure observation preprocessing matches training
- Check input normalization requirements
Performance Issues
- Use deterministic policy for consistent results
- Consider batch inference for multiple agents
Performance Optimization
# Batch processing for multiple agents
def batch_predict(model_session, observations):
"""Process multiple observations at once"""
batch_obs = np.array(observations, dtype=np.float32)
action_probs = model_session.run(None, {"obs_0": batch_obs})
actions = np.argmax(action_probs[0], axis=1)
return actions
This guide provides comprehensive instructions for deploying and using your trained Huggy PPO agent in various scenarios, from simple testing to production deployment.
- Downloads last month
- 54