{"id":7621,"date":"2025-10-20T06:54:00","date_gmt":"2025-10-19T21:54:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=7621"},"modified":"2025-10-18T15:36:54","modified_gmt":"2025-10-18T06:36:54","slug":"internvla-m1-a-spatially-guided-vision-language-action-framework-for-generalist-robot-policy","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=7621","title":{"rendered":"InternVLA-M1, Vlaser"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy&nbsp;<\/strong>[138.9]<br>\u7a7a\u9593\u63a5\u5730\u3068\u30ed\u30dc\u30c3\u30c8\u5236\u5fa1\u306e\u305f\u3081\u306e\u7d71\u5408\u30d5\u30ec\u30fc\u30e0\u30ef\u30fc\u30af\u3067\u3042\u308bInternVLA-M1\u3092\u7d39\u4ecb\u3059\u308b\u3002 InternVLA-M1\u306f\u3001(i)2.3M\u4ee5\u4e0a\u306e\u7a7a\u9593\u7684\u63a8\u8ad6\u30c7\u30fc\u30bf\u306b\u57fa\u3065\u304f\u7a7a\u9593\u7684\u30b0\u30e9\u30a6\u30f3\u30c9\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3068(ii)\u7a7a\u9593\u7684\u306b\u8a98\u5c0e\u3055\u308c\u305f\u5f8c\u30c8\u30ec\u30fc\u30cb\u30f3\u30b0\u3068\u3044\u3046\u30012\u6bb5\u968e\u306e\u30d1\u30a4\u30d7\u30e9\u30a4\u30f3\u3092\u4f7f\u7528\u3059\u308b\u3002 \u7d50\u679c: InternVLA-M1 \u306f SimplerEnv Google Robot \u3067+14.6%\u3001WidowX \u3067+17%\u3001LIBERO Franka \u3067+4.3% \u3067\u3001\u7a7a\u9593\u8a98\u5c0e\u306a\u3057\u3067\u305d\u306e\u5909\u7a2e\u3092\u4e0a\u56de\u3063\u305f\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2510.13778v1\">\u8ad6\u6587<\/a>&nbsp;&nbsp;<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2510.13778v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>&nbsp; &nbsp;(Wed, 15 Oct 2025 17:30:05 GMT)<\/li>\n\n\n\n<li>Shanghai AI Laboratory\u306b\u3088\u308bVLA\u30d5\u30ec\u30fc\u30e0\u30ef\u30fc\u30af\u3001\u300cOn SimplerEnv (Google Robot and WidowX), InternVLA-M1 achieves a new state-of-the-art, surpassing its variant by improving the average success rate by up to +5.9% and +9.8%, respectively. It also demonstrates strong spatial reasoning capabilities across box, point, and trace prediction tasks.\u300d\u3002<\/li>\n\n\n\n<li>\u30a2\u30fc\u30ad\u30c6\u30af\u30c1\u30e3\u306f\u300cInternVLA-M1 employs the Qwen2.5-VL- 3B-instruct Bai et al (2025a) as the multimodal encoder for System 2, which is to capture spatial priors. It adopts the diffusion policy Chi et al (2023) (86 M) as the Action Expert (System 1, the fast executor), which effectively models embodiment-specific control. This expert is built on the DINOv2 visual encoder Oquab et al (2023) (21 M) and a lightweight state encoder (0.4 M), forming a compact vision\u2013action model. In total, InternVLA-M1 comprises approximately 4.1B parameters.\u300d\u3068\u516c\u958b\u30e2\u30c7\u30eb\u306e\u610f\u7fa9\u3092\u611f\u3058\u308b\u69cb\u6210\u3002spatial prompting\u3092\u30b3\u30a2\u3068\u3057\u3066System2 \u2192 System1\u3092\u6d3b\u7528\u3059\u308b\u69cb\u6210\u3002<\/li>\n\n\n\n<li>\u300cTo bridge the gap between VLM and VLA, we introduce a Post-Pre-Training phase, where large-scale simulated data is used to pre-train the VLA after VLM pre-training. This stage initializes the action head and facilitates the learning of action representations.\u300d\u3068\u3044\u3046\u30a2\u30d7\u30ed\u30fc\u30c1\u3082\u6ce8\u76ee\u3002<\/li>\n\n\n\n<li>\u30ea\u30dd\u30b8\u30c8\u30ea\u306f<a href=\"https:\/\/github.com\/InternRobotics\/InternVLA-M1\">GitHub &#8211; InternRobotics\/InternVLA-M1: InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy<\/a><\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning&nbsp;<\/strong>[124.5]<br>Vlaser &#8211; \u76f8\u4e57\u7684\u5177\u4f53\u7684\u63a8\u8ad6\u6a5f\u80fd\u3092\u5099\u3048\u305f\u30d3\u30b8\u30e7\u30f3\u30fb\u30e9\u30f3\u30b2\u30fc\u30b8\u30fb\u30a2\u30af\u30b7\u30e7\u30f3\u30fb\u30e2\u30c7\u30eb\u3092\u7d39\u4ecb\u3059\u308b\u3002 Vlaser\u306f\u3001\u69d8\u3005\u306a\u5177\u4f53\u7684\u63a8\u8ad6\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u3067\u6700\u5148\u7aef\u306e\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u3092\u9054\u6210\u3059\u308b\u3002 \u63d0\u6848\u624b\u6cd5\u306f,WidowX\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u306e\u6700\u5148\u7aef\u7d50\u679c\u3068,Google Robot\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u306e\u7af6\u5408\u6027\u80fd\u3092\u5b9f\u73fe\u3059\u308b\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2510.11027v1\">\u8ad6\u6587<\/a>&nbsp;&nbsp;<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2510.11027v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>&nbsp; &nbsp;(Mon, 13 Oct 2025 05:51:22 GMT)<\/li>\n\n\n\n<li>\u3053\u3061\u3089\u306fInternVL3\u00a0\u30d9\u30fc\u30b9\u3001\u300cIn this work, we reveal that current embodied reasoning benchmarks exhibit a significant domain gap when compared to real-world robots. This core domain shift arises from the observation that robots have a fundamentally different viewpoint from that of internet datasets.\u300d\u3068\u30c7\u30fc\u30bf\u306e\u91cd\u8981\u6027\u3092\u5f37\u8abf\u3002<\/li>\n\n\n\n<li>\u30ea\u30dd\u30b8\u30c8\u30ea\u306f<a href=\"https:\/\/github.com\/OpenGVLab\/Vlaser\/\">GitHub &#8211; OpenGVLab\/Vlaser: Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning<\/a><\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[8,342,524],"class_list":["post-7621","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-action","tag-robot","tag-524"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7621","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7621"}],"version-history":[{"count":3,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7621\/revisions"}],"predecessor-version":[{"id":7634,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7621\/revisions\/7634"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7621"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7621"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7621"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}