<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Llama.cpp on Grayrecord Technow Blog</title>
    <link>https://technow.grayrecord.com/tags/llama.cpp/</link>
    <description>Recent content in Llama.cpp on Grayrecord Technow Blog</description>
    <image>
      <title>Grayrecord Technow Blog</title>
      <url>https://technow.grayrecord.com/images/Grayrecord-technow.png</url>
      <link>https://technow.grayrecord.com/images/Grayrecord-technow.png</link>
    </image>
    <generator>Hugo -- 0.160.1</generator>
    <language>ja</language>
    <lastBuildDate>Tue, 19 Aug 2025 18:00:00 +0900</lastBuildDate>
    <atom:link href="https://technow.grayrecord.com/tags/llama.cpp/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Gemma 3 270M を Raspberry Pi 400 で動かす</title>
      <link>https://technow.grayrecord.com/post/gemma3-270-on-raspberrypi400/</link>
      <pubDate>Tue, 19 Aug 2025 18:00:00 +0900</pubDate>
      <guid>https://technow.grayrecord.com/post/gemma3-270-on-raspberrypi400/</guid>
      <description>&lt;p&gt;&lt;img loading=&#34;lazy&#34; src=&#34;https://technow.grayrecord.com/images/gemma_on_raspberry.jpg&#34;&gt;&lt;/p&gt;
&lt;p&gt;Gemma 3 270MモデルがRaspberry Pi 400上で動作させることが確認されました。&lt;/p&gt;
&lt;h2 id=&#34;ビルドと実行方法&#34;&gt;ビルドと実行方法&lt;/h2&gt;
&lt;ol&gt;
&lt;li&gt;ビルドコマンド&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;以下のコマンドを使用して、llama.cppをビルドします。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake -B build -DGGML_NATIVE&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;ON -DLLAMA_NEON&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;ON -DLLAMA_CURL&lt;span class=&#34;o&#34;&gt;=&lt;/span&gt;OFF
&lt;/span&gt;&lt;/span&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;cmake --build build --config Release -j4
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;ol start=&#34;2&#34;&gt;
&lt;li&gt;ベンチマーク実行
ビルド後、ベンチマークを実行します。&lt;/li&gt;
&lt;/ol&gt;
&lt;pre tabindex=&#34;0&#34;&gt;&lt;code&gt;./llama.cpp/build/bin/llama-bench -m models/gemma-3-270m-it-Q2_K.gguf
&lt;/code&gt;&lt;/pre&gt;&lt;p&gt;ベンチマーク結果は以下の通りです。&lt;/p&gt;
&lt;table&gt;
  &lt;thead&gt;
      &lt;tr&gt;
          &lt;th&gt;model&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;size&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;params&lt;/th&gt;
          &lt;th&gt;backend&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;threads&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;test&lt;/th&gt;
          &lt;th style=&#34;text-align: right&#34;&gt;t/s&lt;/th&gt;
      &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
      &lt;tr&gt;
          &lt;td&gt;gemma3 270M Q2_K - Medium&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;219.87 MiB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;268.10 M&lt;/td&gt;
          &lt;td&gt;CPU&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;pp512&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;47.36 ± 0.04&lt;/td&gt;
      &lt;/tr&gt;
      &lt;tr&gt;
          &lt;td&gt;gemma3 270M Q2_K - Medium&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;219.87 MiB&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;268.10 M&lt;/td&gt;
          &lt;td&gt;CPU&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;4&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;tg128&lt;/td&gt;
          &lt;td style=&#34;text-align: right&#34;&gt;2.48 ± 0.00&lt;/td&gt;
      &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;
&lt;ol start=&#34;3&#34;&gt;
&lt;li&gt;対話モードの実行とテスト&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;以下のコマンドで対話モードを実行します。&lt;/p&gt;
&lt;div class=&#34;highlight&#34;&gt;&lt;pre tabindex=&#34;0&#34; class=&#34;chroma&#34;&gt;&lt;code class=&#34;language-bash&#34; data-lang=&#34;bash&#34;&gt;&lt;span class=&#34;line&#34;&gt;&lt;span class=&#34;cl&#34;&gt;./llama.cpp/build/bin/llama-cli -m models/gemma-3-270m-it-Q2_K.gguf
&lt;/span&gt;&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;p&gt;build: 6201 (9d262f4b) with cc (Debian 12.2.0-14+deb12u1) 12.2.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 45 key-value pairs and 236 tensors from models/gemma-3-270m-it-Q2_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = gemma3
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Gemma-3-270M-It
llama_model_loader: - kv   3:                           general.finetune str              = it
llama_model_loader: - kv   4:                           general.basename str              = Gemma-3-270M-It
llama_model_loader: - kv   5:                       general.quantized_by str              = Unsloth
llama_model_loader: - kv   6:                         general.size_label str              = 270M
llama_model_loader: - kv   7:                            general.license str              = gemma
llama_model_loader: - kv   8:                           general.repo_url str              = &lt;a href=&#34;https://huggingface.co/unsloth&#34;&gt;https://huggingface.co/unsloth&lt;/a&gt;
llama_model_loader: - kv   9:                   general.base_model.count u32              = 1
llama_model_loader: - kv  10:                  general.base_model.0.name str              = Gemma 3 270m It
llama_model_loader: - kv  11:          general.base_model.0.organization str              = Gg Hf Gm
llama_model_loader: - kv  12:              general.base_model.0.repo_url str              = &lt;a href=&#34;https://huggingface.co/gg-hf-gm/gemma&#34;&gt;https://huggingface.co/gg-hf-gm/gemma&lt;/a&gt;&amp;hellip;
llama_model_loader: - kv  13:                               general.tags arr[str,5]       = [&amp;ldquo;gemma3&amp;rdquo;, &amp;ldquo;unsloth&amp;rdquo;, &amp;ldquo;gemma&amp;rdquo;, &amp;ldquo;googl&amp;hellip;
llama_model_loader: - kv  14:                      gemma3.context_length u32              = 32768
llama_model_loader: - kv  15:                    gemma3.embedding_length u32              = 640
llama_model_loader: - kv  16:                         gemma3.block_count u32              = 18
llama_model_loader: - kv  17:                 gemma3.feed_forward_length u32              = 2048
llama_model_loader: - kv  18:                gemma3.attention.head_count u32              = 4
llama_model_loader: - kv  19:    gemma3.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  20:                gemma3.attention.key_length u32              = 256
llama_model_loader: - kv  21:              gemma3.attention.value_length u32              = 256
llama_model_loader: - kv  22:                      gemma3.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  23:            gemma3.attention.sliding_window u32              = 512
llama_model_loader: - kv  24:             gemma3.attention.head_count_kv u32              = 1
llama_model_loader: - kv  25:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  26:                         tokenizer.ggml.pre str              = default
llama_model_loader: - kv  27:                      tokenizer.ggml.tokens arr[str,262144]  = [&amp;quot;&lt;pad&gt;&amp;rdquo;, &amp;ldquo;&lt;eos&gt;&amp;rdquo;, &amp;ldquo;&lt;bos&gt;&amp;rdquo;, &amp;ldquo;&lt;unk&gt;&amp;rdquo;, &amp;hellip;
llama_model_loader: - kv  28:                      tokenizer.ggml.scores arr[f32,262144]  = [-1000.000000, -1000.000000, -1000.00&amp;hellip;
llama_model_loader: - kv  29:                  tokenizer.ggml.token_type arr[i32,262144]  = [3, 3, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, &amp;hellip;
llama_model_loader: - kv  30:                tokenizer.ggml.bos_token_id u32              = 2
llama_model_loader: - kv  31:                tokenizer.ggml.eos_token_id u32              = 106
llama_model_loader: - kv  32:            tokenizer.ggml.unknown_token_id u32              = 3
llama_model_loader: - kv  33:            tokenizer.ggml.padding_token_id u32              = 0
llama_model_loader: - kv  34:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  35:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  36:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  37:                    tokenizer.chat_template str              = {# Unsloth Chat template fixes #}\n{{ &amp;hellip;
llama_model_loader: - kv  38:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - kv  39:               general.quantization_version u32              = 2
llama_model_loader: - kv  40:                          general.file_type u32              = 10
llama_model_loader: - kv  41:                      quantize.imatrix.file str              = gemma-3-270m-it-GGUF/imatrix_unsloth&amp;hellip;.
llama_model_loader: - kv  42:                   quantize.imatrix.dataset str              = unsloth_calibration_gemma-3-270m-it.txt
llama_model_loader: - kv  43:             quantize.imatrix.entries_count u32              = 126
llama_model_loader: - kv  44:              quantize.imatrix.chunks_count u32              = 141
llama_model_loader: - type  f32:  109 tensors
llama_model_loader: - type q5_0:   18 tensors
llama_model_loader: - type q8_0:    1 tensors
llama_model_loader: - type q3_K:   36 tensors
llama_model_loader: - type iq4_nl:   72 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q2_K - Medium
print_info: file size   = 219.87 MiB (6.88 BPW)
load: printing all EOG tokens:
load:   - 106 (&amp;rsquo;&amp;lt;end_of_turn&amp;gt;&amp;rsquo;)
load: special tokens cache size = 6414
load: token to piece cache size = 1.9446 MB
print_info: arch             = gemma3
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 640
print_info: n_layer          = 18
print_info: n_head           = 4
print_info: n_head_kv        = 1
print_info: n_rot            = 256
print_info: n_swa            = 512
print_info: is_swa_any       = 1
print_info: n_embd_head_k    = 256
print_info: n_embd_head_v    = 256
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 6.2e-02
print_info: n_ff             = 2048
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = ?B
print_info: model params     = 268.10 M
print_info: general.name     = Gemma-3-270M-It
print_info: vocab type       = SPM
print_info: n_vocab          = 262144
print_info: n_merges         = 0
print_info: BOS token        = 2 &amp;lsquo;&lt;bos&gt;&amp;rsquo;
print_info: EOS token        = 106 &amp;lsquo;&amp;lt;end_of_turn&amp;gt;&amp;rsquo;
print_info: EOT token        = 106 &amp;lsquo;&amp;lt;end_of_turn&amp;gt;&amp;rsquo;
print_info: UNK token        = 3 &amp;lsquo;&lt;unk&gt;&amp;rsquo;
print_info: PAD token        = 0 &amp;lsquo;&lt;pad&gt;&amp;rsquo;
print_info: LF token         = 248 &amp;lsquo;&amp;lt;0x0A&amp;gt;&amp;rsquo;
print_info: EOG token        = 106 &amp;lsquo;&amp;lt;end_of_turn&amp;gt;&amp;rsquo;
print_info: max token length = 48
load_tensors: loading model tensors, this can take a while&amp;hellip; (mmap = true)
load_tensors:   CPU_Mapped model buffer size =   219.87 MiB
&amp;hellip;&amp;hellip;&amp;hellip;&amp;hellip;&amp;hellip;&amp;hellip;&amp;hellip;&amp;hellip;
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) &amp;lt; n_ctx_train (32768) &amp;ndash; the full capacity of the model will not be utilized
llama_context:        CPU  output buffer size =     1.00 MiB
llama_kv_cache_unified_iswa: creating non-SWA KV cache, size = 4096 cells
llama_kv_cache_unified:        CPU KV buffer size =    12.00 MiB
llama_kv_cache_unified: size =   12.00 MiB (  4096 cells,   3 layers,  1/1 seqs), K (f16):    6.00 MiB, V (f16):    6.00 MiB
llama_kv_cache_unified_iswa: creating     SWA KV cache, size = 1024 cells
llama_kv_cache_unified:        CPU KV buffer size =    15.00 MiB
llama_kv_cache_unified: size =   15.00 MiB (  1024 cells,  15 layers,  1/1 seqs), K (f16):    7.50 MiB, V (f16):    7.50 MiB
llama_context:        CPU compute buffer size =   513.25 MiB
llama_context: graph nodes  = 799
llama_context: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this context, disabling KV cache shifting
common_init_from_params: added &amp;lt;end_of_turn&amp;gt; logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait &amp;hellip; (&amp;ndash;no-warmup to disable)
main: llama threadpool init, n_threads = 4
main: chat template is available, enabling conversation mode (disable it with -no-cnv)
main: chat template example:
&amp;lt;start_of_turn&amp;gt;user
You are a helpful assistant&lt;/p&gt;</description>
    </item>
  </channel>
</rss>
