Best practices for deploying DeepSeek-OCR models

更新时间:
复制 MD 格式

This document shows you how to deploy a DeepSeek-OCR model from development to production on the FunModel platform. Using the DevPod cloud development environment, you will learn how to develop, debug, package, and deploy DeepSeek-OCR model services. This process enables seamless collaboration between development and Operations and Maintenance (O&M).

Prerequisites

Basic requirements

  1. Have an Alibaba Cloud account.

  2. Log on to the FunModel console and follow the on-screen instructions to complete configurations, such as RAM role authorization.

    Important: If you are using the old console, click the "New Console" button in the upper-right corner to switch to the new console.
  3. Technical knowledge: You must have a basic understanding of Python and deep learning model deployment.

Environment preparation

Complete the environment setup and basic tests in the DeepSeek-OCR QuickStart guide to familiarize yourself with basic DevPod operations.

Development and debugging

This section describes how to develop a production-grade DeepSeek-OCR model service in a DevPod environment.

DevPod environment advantages

The DeepSeek-OCR DevPod provides the following:

  • Pre-configured environment: The environment includes pre-installed deep learning frameworks, such as PyTorch, vLLM, and Transformers.

  • GPU resources: Ready-to-use GPU computing power without requiring local configuration.

  • Persistent storage: The NAS mount path /mnt/{model_name} automatically saves model files.

  • Unified workspace: You can develop, debug, and deploy in the same environment to eliminate discrepancies.

Model servitization development

Why is servitization necessary?

The official DeepSeek-OCR provides command-line scripts that are suitable for local testing but not for direct use in a production environment. To integrate Optical Character Recognition (OCR) capabilities into your business systems, you must encapsulate them as an HTTP API service:

Comparison dimension

Command-line script

HTTP service

Access method

Requires logging on to a server to execute

Remote calls through an HTTP API

Business integration

Difficult to integrate into business systems

Can be called by any language through HTTP

Concurrency capability

Single-process serial processing

Supports concurrent requests from multiple users

Extensibility

Difficult to scale out horizontally

Can deploy multiple instances for load balancing

Create the server-side code

In DevPod, create a server.py file to encapsulate the model as a FastAPI service:

# Execute in the DevPod terminal
cd /workspace/DeepSeek-OCR/DeepSeek-OCR-master/DeepSeek-OCR-vllm
touch server.py

Then, open server.py in the Web IDE and write the servitization code.

Core code analysis

The following is the core implementation of a production-grade DeepSeek-OCR service that supports batch processing of images and PDFs:

1. Model initialization and configuration
# Configure environment variables
if torch.version.cuda == '11.8':
    os.environ["TRITON_PTXAS_PATH"] = "/usr/local/cuda-11.8/bin/ptxas"
os.environ['VLLM_USE_V1'] = '0'
os.environ["CUDA_VISIBLE_DEVICES"] = '0'

# Register and load the model
ModelRegistry.register_model("DeepseekOCRForCausalLM", DeepseekOCRForCausalLM)

llm = LLM(
    model=MODEL_PATH,
    hf_overrides={"architectures": ["DeepseekOCRForCausalLM"]},
    block_size=256,
    max_model_len=8192,
    max_num_seqs=max(MAX_CONCURRENCY, 100),
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory
    disable_mm_preprocessor_cache=True
)
2. API interface design
class InputData(BaseModel):
    """Supports mixed input of images and PDFs"""
    images: Optional[List[str]] = None  # List of image URLs
    pdfs: Optional[List[str]] = None    # List of PDF URLs

class RequestData(BaseModel):
    """Request model, supports custom prompts"""
    input: InputData
    prompt: str = '<image>\nFree OCR.'  # Default prompt

class ResponseData(BaseModel):
    """Returns OCR recognition results"""
    output: List[str]
3. Asynchronous concurrent processing
async def process_items_async(items_urls: List[str], is_pdf: bool, prompt: str):
    """
    Asynchronously process a list of image/PDF URLs
    - Concurrently download files
    - Use a thread pool for image pre-processing
    - Return batch inference inputs
    """
    loop = asyncio.get_event_loop()
    
    # Concurrently download all files
    download_tasks = [loop.run_in_executor(None, download_file, url) 
                      for url in items_urls]
    contents = await asyncio.gather(*download_tasks)
    
    # Process images in a thread pool
    with ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
        process_tasks = [
            loop.run_in_executor(executor, process_single_image_sync, img, prompt)
            for img, prompt in processing_args
        ]
        processed_results = await asyncio.gather(*process_tasks)
    
    return processed_results, num_results_per_input
4. Batch inference interface
@app.post("/ocr_batch", response_model=ResponseData)
async def ocr_batch_inference(request: RequestData):
    """
    Batch OCR processing interface
    - Supports mixed input of images and PDFs
    - Automatically handles multi-page PDF scenarios
    - Returns structured recognition results
    """
    # Process images and PDFs
    all_batch_inputs = []
    if request.input.images:
        batch_inputs_images, counts_images = await process_items_async(
            request.input.images, is_pdf=False, prompt=request.prompt
        )
        all_batch_inputs.extend(batch_inputs_images)
    
    if request.input.pdfs:
        batch_inputs_pdfs, counts_pdfs = await process_items_async(
            request.input.pdfs, is_pdf=True, prompt=request.prompt
        )
        all_batch_inputs.extend(batch_inputs_pdfs)
    
    # Batch inference
    outputs_list = await run_inference(all_batch_inputs)
    
    # Reorganize results (merge multi-page PDFs)
    return ResponseData(output=final_outputs)

Full code example

import os
import io
import torch
import uvicorn
import requests
from PIL import Image
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from typing import Optional, Dict, Any, List
import tempfile
import fitz
from concurrent.futures import ThreadPoolExecutor
import asyncio

# Set environment variables
if torch.version.cuda == '11.8':
    os.environ["TRITON_PTXAS_PATH"] = "/usr/local/cuda-11.8/bin/ptxas"
os.environ['VLLM_USE_V1'] = '0'
os.environ["CUDA_VISIBLE_DEVICES"] = '0'

from config import MODEL_PATH, CROP_MODE, MAX_CONCURRENCY, NUM_WORKERS
from vllm import LLM, SamplingParams
from vllm.model_executor.models.registry import ModelRegistry
from deepseek_ocr import DeepseekOCRForCausalLM
from process.ngram_norepeat import NoRepeatNGramLogitsProcessor
from process.image_process import DeepseekOCRProcessor

# Register model
ModelRegistry.register_model("DeepseekOCRForCausalLM", DeepseekOCRForCausalLM)

# Initialize model
print("Loading model...")
llm = LLM(
    model=MODEL_PATH,
    hf_overrides={"architectures": ["DeepseekOCRForCausalLM"]},
    block_size=256,           # Memory block size for KV cache
    enforce_eager=False,      # Use eager mode for better performance with multimodal models
    trust_remote_code=True,   # Allow execution of code from remote repositories
    max_model_len=8192,       # Maximum sequence length the model can handle
    swap_space=0,             # No swapping to CPU, keeping everything on GPU
    max_num_seqs=max(MAX_CONCURRENCY, 100),  # Maximum number of sequences to process concurrently
    tensor_parallel_size=1,   # Number of GPUs for tensor parallelism (1 = single GPU)
    gpu_memory_utilization=0.9,  # Use 90% of GPU memory for model execution
    disable_mm_preprocessor_cache=True  # Disable cache for multimodal preprocessor to avoid issues
)

# Configure sampling parameters
# NoRepeatNGramLogitsProcessor prevents repetition in generated text by tracking n-gram patterns
logits_processors = [NoRepeatNGramLogitsProcessor(ngram_size=20, window_size=50, whitelist_token_ids={128821, 128822})]
sampling_params = SamplingParams(
    temperature=0.0,                    # Deterministic output (greedy decoding)
    max_tokens=8192,                    # Maximum number of tokens to generate
    logits_processors=logits_processors, # Apply the processor to avoid repetitive text
    skip_special_tokens=False,          # Include special tokens in the output
    include_stop_str_in_output=True,    # Include stop strings in the output
)

# Initialize FastAPI app
app = FastAPI(title="DeepSeek-OCR API", version="1.0.0")

class InputData(BaseModel):
    """
    Input data model to define what types of documents to process
    images: Optional list of image URLs to process
    pdfs: Optional list of PDF URLs to process
    Note: At least one of these fields must be provided in a request
    """
    images: Optional[List[str]] = None
    pdfs: Optional[List[str]] = None

class RequestData(BaseModel):
    """
    Main request model that defines the input data and optional prompt
    """
    input: InputData
    # Add prompt as an optional field with a default value
    prompt: str = '<image>\nFree OCR.' # Default prompt

class ResponseData(BaseModel):
    """
    Response model that returns OCR results for each input document
    """
    output: List[str]

def download_file(url: str) -> bytes:
    """Download file from URL"""
    try:
        response = requests.get(url, timeout=30)
        response.raise_for_status()
        return response.content
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Failed to download file from URL: {str(e)}")

def is_pdf_file(content: bytes) -> bool:
    """Check if the content is a PDF file"""
    return content.startswith(b'%PDF')

def load_image_from_bytes(image_bytes: bytes) -> Image.Image:
    """Load image from bytes"""
    try:
        image = Image.open(io.BytesIO(image_bytes))
        return image.convert('RGB')
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Failed to load image: {str(e)}")

def pdf_to_images(pdf_bytes: bytes, dpi: int = 144) -> list:
    """Convert PDF to images"""
    try:
        images = []
        pdf_document = fitz.open(stream=pdf_bytes, filetype="pdf")
        zoom = dpi / 72.0
        matrix = fitz.Matrix(zoom, zoom)

        for page_num in range(pdf_document.page_count):
            page = pdf_document[page_num]
            pixmap = page.get_pixmap(matrix=matrix, alpha=False)
            img_data = pixmap.tobytes("png")
            img = Image.open(io.BytesIO(img_data))
            images.append(img.convert('RGB'))

        pdf_document.close()
        return images
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Failed to convert PDF to images: {str(e)}")

def process_single_image_sync(image: Image.Image, prompt: str) -> Dict: # Renamed and made sync
    """Process a single image (synchronous function for CPU-bound work)"""
    try:
        cache_item = {
            "prompt": prompt,
            "multi_modal_data": {
                "image": DeepseekOCRProcessor().tokenize_with_images(
                    images=[image],
                    bos=True,
                    eos=True,
                    cropping=CROP_MODE
                )
            },
        }
        return cache_item
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Failed to process image: {str(e)}")

async def process_items_async(items_urls: List[str], is_pdf: bool, prompt: str) -> tuple[List[Dict], List[int]]:
    """
    Process a list of image or PDF URLs asynchronously.
    Downloads files concurrently, then processes images/PDF pages in a thread pool.
    Returns a tuple: (batch_inputs, num_results_per_input)
    """
    loop = asyncio.get_event_loop()

    # 1. Download all files concurrently
    download_tasks = [loop.run_in_executor(None, download_file, url) for url in items_urls]
    contents = await asyncio.gather(*download_tasks)

    # 2. Prepare arguments for processing (determine if PDF/image, count pages)
    processing_args = []
    num_results_per_input = []
    for idx, (url, content) in enumerate(zip(items_urls, contents)):
        if is_pdf:
            if not is_pdf_file(content):
                 raise HTTPException(status_code=400, detail=f"Provided file is not a PDF: {url}")
            images = pdf_to_images(content)
            num_pages = len(images)
            num_results_per_input.append(num_pages)
            # Each page will be processed separately
            processing_args.extend([(img, prompt) for img in images])
        else: # is image
            if is_pdf_file(content):
                # Handle case where an image URL accidentally points to a PDF
                images = pdf_to_images(content)
                num_pages = len(images)
                num_results_per_input.append(num_pages)
                processing_args.extend([(img, prompt) for img in images])
            else:
                image = load_image_from_bytes(content)
                num_results_per_input.append(1)
                processing_args.append((image, prompt))

    # 3. Process images/PDF pages in parallel using ThreadPoolExecutor
    with ThreadPoolExecutor(max_workers=NUM_WORKERS) as executor:
        # Submit all processing tasks
        process_tasks = [
            loop.run_in_executor(executor, process_single_image_sync, img, prompt)
            for img, prompt in processing_args
        ]
        # Wait for all to complete
        processed_results = await asyncio.gather(*process_tasks)

    return processed_results, num_results_per_input

async def run_inference(batch_inputs: List[Dict]) -> List:
    """Run inference on batch inputs"""
    if not batch_inputs:
        return []
    try:
        # Run inference on the entire batch
        outputs_list = llm.generate(
            batch_inputs,
            sampling_params=sampling_params
        )
        return outputs_list
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Failed to run inference: {str(e)}")

@app.post("/ocr_batch", response_model=ResponseData)
async def ocr_batch_inference(request: RequestData):
    """
    Main OCR batch processing endpoint
    Accepts a list of image URLs and/or PDF URLs for OCR processing
    Returns a list of OCR results corresponding to each input document
    Supports both individual image processing and PDF-to-image conversion
    """
    print(f"Received request data: {request}")
    try:
        input_data = request.input
        prompt = request.prompt # Get the prompt from the request
        if not input_data.images and not input_data.pdfs:
            raise HTTPException(status_code=400, detail="Either 'images' or 'pdfs' (or both) must be provided as lists.")

        all_batch_inputs = []
        final_output_parts = []

        # Process images if provided
        if input_data.images:
            batch_inputs_images, counts_images = await process_items_async(input_data.images, is_pdf=False, prompt=prompt)
            all_batch_inputs.extend(batch_inputs_images)
            final_output_parts.append(counts_images)

        # Process PDFs if provided
        if input_data.pdfs:
            batch_inputs_pdfs, counts_pdfs = await process_items_async(input_data.pdfs, is_pdf=True, prompt=prompt)
            all_batch_inputs.extend(batch_inputs_pdfs)
            final_output_parts.append(counts_pdfs)

        if not all_batch_inputs:
             raise HTTPException(status_code=400, detail="No valid images or PDF pages were processed from the input URLs.")

        # Run inference on the combined batch
        outputs_list = await run_inference(all_batch_inputs)

        # Reconstruct final output list based on counts
        final_outputs = []
        output_idx = 0
        # Flatten the counts list
        all_counts = [count for sublist in final_output_parts for count in sublist]

        for count in all_counts:
            # Get 'count' number of outputs for this input
            input_outputs = outputs_list[output_idx : output_idx + count]
            output_texts = []
            for output in input_outputs:
                content = output.outputs[0].text
                if '<|end▁of▁sentence|>' in content:
                    content = content.replace('<|end▁of▁sentence|>', '')
                output_texts.append(content)

            # Combine pages if it was a multi-page PDF input (or image treated as PDF)
            if count > 1:
                combined_text = "\n<--- Page Split --->\n".join(output_texts)
                final_outputs.append(combined_text)
            else:
                # Single image or single-page PDF
                final_outputs.append(output_texts[0] if output_texts else "")

            output_idx += count # Move to the next set of outputs

        return ResponseData(output=final_outputs)

    except HTTPException:
        raise
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Internal server error: {str(e)}")


@app.get("/health")
async def health_check():
    """Health check endpoint"""
    return {"status": "healthy"}

@app.get("/")
async def root():
    """Root endpoint"""
    return {"message": "DeepSeek-OCR API is running (Batch endpoint available at /ocr_batch)"}

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000, workers=1)

Local testing

Start the service

In the DevPod terminal, start the inference service:

cd /workspace/DeepSeek-OCR/DeepSeek-OCR-master/DeepSeek-OCR-vllm
python server.py

The service starts at http://127.0.0.1:8000.

Test a single image

curl -X POST http://127.0.0.1:8000/ocr_batch \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "images": [
        "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png"
      ]
    },
    "prompt": "<image>\n<|grounding|>Convert the document to markdown."
  }'

Test a PDF document

curl -X POST http://127.0.0.1:8000/ocr_batch \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "pdfs": [
        "https://images.devsapp.cn/test/ocr-test.pdf"
      ]
    },
    "prompt": "<image>\nFree OCR."
  }'

Test mixed inputs

curl -X POST http://127.0.0.1:8000/ocr_batch \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "images": [
        "https://paddle-model-ecology.bj.bcebos.com/paddlex/imgs/demo_image/paddleocr_vl_demo.png"
      ],
      "pdfs": [
        "https://images.devsapp.cn/test/ocr-test.pdf"
      ]
    },
    "prompt": "<image>\nFree OCR."
  }'

Remote debugging

DevPod supports remote debugging through a proxy address, which is convenient for testing with tools such as Postman.

Obtain the proxy address

  1. In the DevPod console, click the Quick Access tab.

  2. Obtain the proxy path. The following is an example:

    https://devpod-e***a-lwt***jyw.cn-hangzhou.ide.fc.aliyun.com/proxy/8000/

Test using the proxy address

The following is an example:

curl -X POST \
  "https://devpod-e***a-lwt***jyw.cn-hangzhou.ide.fc.aliyun.com/proxy/8000/ocr_batch" \
  -H "Content-Type: application/json" \
  -d '{
    "input": {
      "pdfs": ["https://images.devsapp.cn/test/ocr-test.pdf"]
    },
    "prompt": "<image>\nFree OCR."
  }'
Note

You can also use API testing tools, such as Postman or Insomnia, to debug through a graphical user interface for a more intuitive and convenient experience.

Image building and deployment

After the model service is verified in the development environment, you can package it into a container image and deploy it to the production environment.

Build an image

DevPod provides a one-click build feature to package the current development environment into a standard container image:

  1. In the DevPod console, you can click Create Image.

  2. Select an ACR instance and configure the image information.

  3. The system automatically builds the image and pushes it to the specified container repository.

Details: For more information, see Image Building and ACR Integration.

Deploy the model

After the image is built and pushed, it is stored in ACR. You can then deploy it as a FunModel model service with one click.

  1. After building the image, click Deploy Now.

  2. Configure service parameters, such as the startup command, listener port, and timeout period.

  3. Click Start Deployment to automatically deploy the model service.

Verify the deployment

After the deployment is complete, you can verify the service in the following ways:

  1. Online debugging: You can use the Online Debugging feature in the FunModel console to run a quick test.

  2. API call: Obtain the service domain name and call the service through an HTTP client.

  3. Performance testing: Use a stress testing tool to verify concurrent processing capabilities.

Monitoring and iteration

FunModel provides comprehensive monitoring and O&M capabilities:

Monitoring metrics

  • Performance monitoring: View real-time GPU utilization, request latency, and throughput.

  • Log analysis: Centrally collect all instance logs and support keyword retrieval and error tracking.

  • Call statistics: View the number of API calls, success rates, and error distribution.

Change management

  • Deployment records: A complete record is kept for every configuration change, such as instance type, timeout period, and scaling policies.

  • Version rollback: You can perform quick rollbacks to stable historical versions.

  • Phased release: You can gradually switch to a new version based on traffic ratios.

Iteration process

When you need to optimize the model or fix issues:

  1. Discover issues: Locate issues through the monitoring dashboard or log analysis.

  2. Develop fixes: Directly modify and test the code in DevPod.

  3. Verify the solution: Fully verify the effectiveness of the fix in the development environment.

  4. Build and deploy: Create a new image and deploy it to the production environment with one click.

  5. Monitor the results: Use monitoring to verify that the issue is resolved.

The entire process is completed in a unified environment. This avoids issues caused by environmental inconsistencies and achieves seamless collaboration between development and O&M.

Best practices summary

Core advantages

The key advantages of using DevPod to deploy the DeepSeek-OCR model service are:

  1. Environment consistency: The development, testing, and production environments are identical, which eliminates environment drift issues.

  2. Resource elasticity: You can allocate GPU resources on demand. Use low-spec instances during development to save costs and scale out as needed in production.

  3. Workflow integration: You can complete all operations in a single workspace without switching between multiple platforms.

  4. Zero learning curve: You can focus on business value without needing to master complex concepts such as Kubernetes or Dockerfile.

  5. Rapid iteration: The entire cycle from code modification to online verification can be shortened to minutes.

Two workflow patterns

FunModel supports two workflows based on team size and engineering requirements:

DevFlow1: One-click deployment flow (Recommended for individuals and small teams)

This flow is suitable for scenarios that require rapid idea validation and iterative optimization.

Features:

  • No need to write a Dockerfile

  • Complete builds and deployments with one click

  • Suitable for rapid prototype validation

  • Lowers the engineering barrier

Flowchart:

DevFlow1 工作流

Procedure:

  1. Development phase: Write code, install dependencies, and debug features in DevPod.

  2. Build phase: Click Create Image to automatically package the current environment into an image.

  3. Deployment phase: Click Deploy Now to configure the service parameters and publish the service.

  4. Iteration phase: If you discover an issue, you can modify it directly in DevPod, then rebuild and deploy.

DevFlow2: Standard engineering flow (Recommended for enterprise teams)

This flow is suitable for teams that pursue engineering standards and long-term maintainability.

Features:

  • Version control for code and configurations

  • Supports multi-person collaboration and code reviews

  • Can be integrated with CI/CD pipelines

  • Traceable and reproducible deployment process

Flowchart:

DevFlow2 工作流

Procedure:

  1. Development preparation: Start DevPod from a specific commit in a Git repository to ensure a consistent baseline.

  2. Feature development: Perform code iteration, dependency installation, and integration testing in DevPod.

  3. Deployment preparation: Write or adjust the Dockerfile to precisely configure the image build logic.

  4. Build and test: The system builds the image according to the Dockerfile and performs end-to-end testing.

  5. Code commit: Commit the code and the Dockerfile to the Git repository together.

  6. Automatic publishing: Automatically build the image and deploy it to the production environment through a CI/CD pipeline.