Running Qwen3.6 35B Locally with Ollama and VS Code Integration

April 17, 2026 - 3 minutes read - 496 words

Overview

Running large language models locally is becoming increasingly practical, even for developers without access to massive GPU clusters.

In this post, I walk through how to:

Run Qwen3.6 35B (A3B, Q4_K_M quantized) locally using Ollama
Integrate the model into VS Code
Use it as a local coding assistant

This setup is especially useful for:

Air-gapped environments
Cost control (no API usage)
Experimenting with local LLM workflows

Architecture Overview

VS Code
   │
   ▼
Local LLM Extension / API Client
   │
   ▼
Ollama Runtime
   │
   ▼
Qwen3.6:35B-a3b-q4_K_M (local model)

Prerequisites

A machine with sufficient RAM / VRAM
Q4_K_M quantization helps reduce memory usage
Installed Ollama
VS Code installed

Optional but recommended:

GPU acceleration (CUDA / ROCm depending on your setup)

Step 1: Install Ollama

Install Ollama from the official site:

curl -fsSL https://ollama.com/install.sh | sh

Verify installation:

ollama --version

Step 2: Pull Qwen3.6 Model

Pull the quantized model:

ollama pull qwen3.6:35b-a3b-q4_K_M

Notes:

35b → 35 billion parameters
a3b → architecture variant
q4_K_M → 4-bit quantization (balanced performance vs memory)

Step 3: Run the Model

Start the model:

ollama run qwen3.6:35b-a3b-q4_K_M

You can now interact with it directly in the terminal.

Step 4: Expose Ollama API

Ollama runs a local API server by default:

http://localhost:11434

Test it:

curl http://localhost:11434/api/generate \
  -d '{
    "model": "qwen3.6:35b-a3b-q4_K_M",
    "prompt": "Explain microservices architecture"
  }'

Step 5: Add Local Model to VS Code

There are multiple ways to integrate with VS Code.

Option 1: Use an LLM Extension

Install an extension that supports custom endpoints, such as:

Continue
CodeGPT
OpenAI-compatible clients
Copilot

Configure it to point to Ollama, please follow the extension’s documentation for custom API endpoints. Here I only provide the one for copilot.

Setup local model for vscode copilot

Option 2: OpenAI-Compatible Proxy (if required)

Some extensions expect OpenAI format.

You can use a lightweight proxy or adapter to map:

/v1/chat/completions → Ollama /api/generate

Step 6: Using the Model in VS Code

Once configured, you can:

Generate code
Refactor functions
Explain code
Generate documentation

Example prompt:

Refactor this Java method to improve readability and performance.

Performance Considerations

Memory

Q4_K_M significantly reduces memory footprint
Still requires substantial RAM for 35B models

Speed

CPU: usable but slow
GPU: much faster (recommended)

Trade-offs

Quantization	Quality	Memory	Speed
Q4_K_M	Medium	Low	Fast
Q5 / Q6	Higher	Medium	Slower
FP16	Best	High	GPU required

Quantization

Quality

Memory

Speed

Q4_K_M

Medium

Low

Fast

Q5 / Q6

Higher

Medium

Slower

FP16

Best

High

GPU required

When to Use This Setup

This setup is ideal if you:

Work in restricted enterprise environments
Want to avoid API costs
Need full control over data
Experiment with local AI workflows

Limitations

Slower than cloud models
Limited context compared to frontier models
Requires tuning for best results

Conclusion

Combining Ollama + Qwen3.6 35B + VS Code gives you a powerful local AI coding assistant.

While not a full replacement for cloud-based models, it is:

Private
Flexible
Cost-efficient

And increasingly practical with modern hardware.

Next Steps

Add RAG (Retrieval-Augmented Generation)
Integrate with your local codebase
Experiment with fine-tuning or LoRA
Compare with vLLM-based setups

References

Ollama documentation
Qwen model releases
VS Code extension marketplace